[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518707#comment-14518707 ] Rick Branson commented on CASSANDRA-7486: - I think it definitely makes sense as a default. My guess is that it'll result in fewer headaches for most people. > Compare CMS and G1 pause times > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: Test > Components: Config >Reporter: Jonathan Ellis >Assignee: Shawn Kumar > Fix For: 2.1.5 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9258) Range movement causes CPU & performance impact
[ https://issues.apache.org/jira/browse/CASSANDRA-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518182#comment-14518182 ] Rick Branson commented on CASSANDRA-9258: - This is also an issue for large 2.0 clusters as well. The TokenMetadata path is identical between 2.1 and 2.0. > Range movement causes CPU & performance impact > -- > > Key: CASSANDRA-9258 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9258 > Project: Cassandra > Issue Type: Bug > Environment: Cassandra 2.1.4 >Reporter: Rick Branson > Fix For: 2.1.6 > > > Observing big CPU & latency regressions when doing range movements on > clusters with many tens of thousands of vnodes. See CPU usage increase by > ~80% when a single node is being replaced. > Top methods are: > 1) Ljava/math/BigInteger;.compareTo in > Lorg/apache/cassandra/dht/ComparableObjectToken;.compareTo > 2) Lcom/google/common/collect/AbstractMapBasedMultimap;.wrapCollection in > Lcom/google/common/collect/AbstractMapBasedMultimap$AsMap$AsMapIterator;.next > 3) Lorg/apache/cassandra/db/DecoratedKey;.compareTo in > Lorg/apache/cassandra/dht/Range;.contains > Here's a sample stack from a thread dump: > {code} > "Thrift:50673" daemon prio=10 tid=0x7f2f20164800 nid=0x3a04af runnable > [0x7f2d878d] >java.lang.Thread.State: RUNNABLE > at org.apache.cassandra.dht.Range.isWrapAround(Range.java:260) > at org.apache.cassandra.dht.Range.contains(Range.java:51) > at org.apache.cassandra.dht.Range.contains(Range.java:110) > at > org.apache.cassandra.locator.TokenMetadata.pendingEndpointsFor(TokenMetadata.java:916) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:775) > at > org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:541) > at > org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:616) > at > org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1101) > at > org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1083) > at > org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:976) > at > org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3996) > at > org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3980) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9258) Range movement causes CPU & performance impact
Rick Branson created CASSANDRA-9258: --- Summary: Range movement causes CPU & performance impact Key: CASSANDRA-9258 URL: https://issues.apache.org/jira/browse/CASSANDRA-9258 Project: Cassandra Issue Type: Bug Environment: Cassandra 2.1.4 Reporter: Rick Branson Observing big CPU & latency regressions when doing range movements on clusters with many tens of thousands of vnodes. See CPU usage increase by ~80% when a single node is being replaced. Top methods are: 1) Ljava/math/BigInteger;.compareTo in Lorg/apache/cassandra/dht/ComparableObjectToken;.compareTo 2) Lcom/google/common/collect/AbstractMapBasedMultimap;.wrapCollection in Lcom/google/common/collect/AbstractMapBasedMultimap$AsMap$AsMapIterator;.next 3) Lorg/apache/cassandra/db/DecoratedKey;.compareTo in Lorg/apache/cassandra/dht/Range;.contains Here's a sample stack from a thread dump: "Thrift:50673" daemon prio=10 tid=0x7f2f20164800 nid=0x3a04af runnable [0x7f2d878d] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.dht.Range.isWrapAround(Range.java:260) at org.apache.cassandra.dht.Range.contains(Range.java:51) at org.apache.cassandra.dht.Range.contains(Range.java:110) at org.apache.cassandra.locator.TokenMetadata.pendingEndpointsFor(TokenMetadata.java:916) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:775) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:541) at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:616) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1101) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:1083) at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:976) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3996) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3980) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:205) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9244) replace_address is not topology-aware
[ https://issues.apache.org/jira/browse/CASSANDRA-9244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512756#comment-14512756 ] Rick Branson commented on CASSANDRA-9244: - GossipingPropertyFileSnitch. The scenario is say you have NTS with RF=3, 10 racks A-J, and range R1, which is spread across nodes A1, C1, and E1 in racks A, C, and E respectively. If you replace_address C1 with node E2, it will cause range R1 to shift to A1, E2, and some node in neither A nor E racks. A similar range change is achieved if you were to bootstrap with tokens-1, but the shifts will work properly in that case. > replace_address is not topology-aware > - > > Key: CASSANDRA-9244 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9244 > Project: Cassandra > Issue Type: Bug > Environment: 2.0.12 >Reporter: Rick Branson > > Replaced a node with one in another rack (using replace_address) and it > caused improper distribution after the bootstrap was finished. It looks like > the ranges for the streams are not created in a way that is topology-aware. > This should probably either be prevented, or ideally, would work properly. > The use case is migrating several nodes from one rack to another. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9244) replace_address is not topology-aware
Rick Branson created CASSANDRA-9244: --- Summary: replace_address is not topology-aware Key: CASSANDRA-9244 URL: https://issues.apache.org/jira/browse/CASSANDRA-9244 Project: Cassandra Issue Type: Bug Environment: 2.0.12 Reporter: Rick Branson Replaced a node with one in another rack (using replace_address) and it caused improper distribution after the bootstrap was finished. It looks like the ranges for the streams are not created in a way that is topology-aware. This should probably either be prevented, or ideally, would work properly. The use case is migrating several nodes from one rack to another. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-9227) Support num_tokens: 0
[ https://issues.apache.org/jira/browse/CASSANDRA-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson resolved CASSANDRA-9227. - Resolution: Not A Problem We tested join_ring=false after 2.0.7 came out and didn't have any luck. It would join the ring but end up looping. I tested it again and it works perfectly now. Thanks and sorry for confusion :) Will probably post a short GIST when we try this in prod. > Support num_tokens: 0 > - > > Key: CASSANDRA-9227 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9227 > Project: Cassandra > Issue Type: Improvement >Reporter: Rick Branson > > num_tokens: 0 would allow serving proxy traffic on non-storage machines. > Token-aware routing does improve the cost of proxying a bit, but it only > works in cases where CL<=1 for reads, and doesn't really improve the write > situation. For some installations, token-aware routing is also impractical, > such as when there are a very large number of clients. > For use cases that are throughput-bound rather than storage-bound, this would > allow running CPU-intensive proxy work on cheaper "diskless" machines. > Proxies could also be installed on the same machine as the client or on > separate "edge" machines that are closer to the client than the storage > nodes. In my observation, anywhere between 25-50% of a node's CPU is spent on > proxy work, depending on the workload. > I say "diskless" because they can't be truly diskless as they would have to > store hints. Hint storage can use cheap spinning disks or small solid-state > drives instead of expensive big disk arrays or big SSDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-9227) Support num_tokens: 0
[ https://issues.apache.org/jira/browse/CASSANDRA-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-9227: Description: num_tokens: 0 would allow serving proxy traffic on non-storage machines. Token-aware routing does improve the cost of proxying a bit, but it only works in cases where CL<=1 for reads, and doesn't really improve the write situation. For some installations, token-aware routing is also impractical, such as when there are a very large number of clients. For use cases that are throughput-bound rather than storage-bound, this would allow running CPU-intensive proxy work on cheaper "diskless" machines. Proxies could also be installed on the same machine as the client or on separate "edge" machines that are closer to the client than the storage nodes. In my observation, anywhere between 25-50% of a node's CPU is spent on proxy work, depending on the workload. I say "diskless" because they can't be truly diskless as they would have to store hints. Hint storage can use cheap spinning disks or small solid-state drives instead of expensive big disk arrays or big SSDs. was: num_tokens: 0 would allow serving proxy traffic on non-storage machines. Token-aware routing does improve the cost of proxying a bit, but it only works in cases where CL>1 for reads, and doesn't really improve the write situation. For some installations, token-aware routing is also impractical, such as when there are a very large number of clients. For use cases that are throughput-bound rather than storage-bound, this would allow running CPU-intensive proxy work on cheaper "diskless" machines. Proxies could also be installed on the same machine as the client or on separate "edge" machines that are closer to the client than the storage nodes. In my observation, anywhere between 25-50% of a node's CPU is spent on proxy work, depending on the workload. I say "diskless" because they can't be truly diskless as they would have to store hints. Hint storage can use cheap spinning disks or small solid-state drives instead of expensive big disk arrays or big SSDs. > Support num_tokens: 0 > - > > Key: CASSANDRA-9227 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9227 > Project: Cassandra > Issue Type: Improvement >Reporter: Rick Branson > > num_tokens: 0 would allow serving proxy traffic on non-storage machines. > Token-aware routing does improve the cost of proxying a bit, but it only > works in cases where CL<=1 for reads, and doesn't really improve the write > situation. For some installations, token-aware routing is also impractical, > such as when there are a very large number of clients. > For use cases that are throughput-bound rather than storage-bound, this would > allow running CPU-intensive proxy work on cheaper "diskless" machines. > Proxies could also be installed on the same machine as the client or on > separate "edge" machines that are closer to the client than the storage > nodes. In my observation, anywhere between 25-50% of a node's CPU is spent on > proxy work, depending on the workload. > I say "diskless" because they can't be truly diskless as they would have to > store hints. Hint storage can use cheap spinning disks or small solid-state > drives instead of expensive big disk arrays or big SSDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-9227) Support num_tokens: 0
Rick Branson created CASSANDRA-9227: --- Summary: Support num_tokens: 0 Key: CASSANDRA-9227 URL: https://issues.apache.org/jira/browse/CASSANDRA-9227 Project: Cassandra Issue Type: Improvement Reporter: Rick Branson num_tokens: 0 would allow serving proxy traffic on non-storage machines. Token-aware routing does improve the cost of proxying a bit, but it only works in cases where CL>1 for reads, and doesn't really improve the write situation. For some installations, token-aware routing is also impractical, such as when there are a very large number of clients. For use cases that are throughput-bound rather than storage-bound, this would allow running CPU-intensive proxy work on cheaper "diskless" machines. Proxies could also be installed on the same machine as the client or on separate "edge" machines that are closer to the client than the storage nodes. In my observation, anywhere between 25-50% of a node's CPU is spent on proxy work, depending on the workload. I say "diskless" because they can't be truly diskless as they would have to store hints. Hint storage can use cheap spinning disks or small solid-state drives instead of expensive big disk arrays or big SSDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8835) 100% CPU spikes from disruptor thrift server
[ https://issues.apache.org/jira/browse/CASSANDRA-8835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327993#comment-14327993 ] Rick Branson commented on CASSANDRA-8835: - also seeing this pop to the top simultaneously on some nodes (above selectorIterationComplete): Ljava/util/HashMap$HashIterator;.remove in Lcom/thinkaurelius/thrift/TDisruptorServer$AbstractSelectorThread;.select > 100% CPU spikes from disruptor thrift server > > > Key: CASSANDRA-8835 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8835 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 2.0.11, hsha, JDK7u65, ~6,000 connections per machine >Reporter: Rick Branson > > Seeing big CPU spikes (to 100%) inside of the TDisruptorServer when running > hsha. perf top loaded with JVM symbols shows this one floating to the top for > anywhere between 15 and 90 seconds and then falling down: > Lcom/thinkaurelius/thrift/TDisruptorServer$SelectorThread;.selectorIterationComplete > in Lcom/thinkaurelius/thrift/TDisruptorServer$AbstractSelectorThread;.select -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8835) 100% CPU spikes from disruptor thrift server
Rick Branson created CASSANDRA-8835: --- Summary: 100% CPU spikes from disruptor thrift server Key: CASSANDRA-8835 URL: https://issues.apache.org/jira/browse/CASSANDRA-8835 Project: Cassandra Issue Type: Bug Components: Core Environment: 2.0.11, hsha, JDK7u65, ~6,000 connections per machine Reporter: Rick Branson Seeing big CPU spikes (to 100%) inside of the TDisruptorServer when running hsha. perf top loaded with JVM symbols shows this one floating to the top for anywhere between 15 and 90 seconds and then falling down: Lcom/thinkaurelius/thrift/TDisruptorServer$SelectorThread;.selectorIterationComplete in Lcom/thinkaurelius/thrift/TDisruptorServer$AbstractSelectorThread;.select -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14251110#comment-14251110 ] Rick Branson commented on CASSANDRA-8463: - Code LGTM. > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, > 0001-make-sure-we-set-lastCompactedKey-properly.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249047#comment-14249047 ] Rick Branson commented on CASSANDRA-8463: - It did take some time to rebalance the tables within the levels, but nodes are able to clear their compaction log with the patch applied. +1 > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, > 0001-make-sure-we-set-lastCompactedKey-properly.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248518#comment-14248518 ] Rick Branson commented on CASSANDRA-8463: - I applied that patch to one node. What should I be looking for? Is there any persisted state for that data? Shortly after restarting the daemon it did another one of those single table L1 compactions into L2. After finishing that, it immediately compacted that single new L2 file into 452 L3 tables. That pattern is what's causing the problem. > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, > 0001-make-sure-we-set-lastCompactedKey-properly.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8462) Upgrading a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes
[ https://issues.apache.org/jira/browse/CASSANDRA-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247499#comment-14247499 ] Rick Branson commented on CASSANDRA-8462: - [~iamaleksey] this happened first time I restarted a 2.0 node after adding a 2.1 node to the cluster. It happened every time I restarted a 2.0 node, until I upgraded it to 2.1. > Upgrading a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes > - > > Key: CASSANDRA-8462 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8462 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > > Added a 2.1.2 node to a cluster running 2.0.11. Didn't make any schema > changes. When I tried to reboot one of the 2.0 nodes, it failed to boot with > this exception. Besides an obvious fix, any workarounds for this? > {noformat} > java.lang.IllegalArgumentException: No enum constant > org.apache.cassandra.config.CFMetaData.Caching.{"keys":"ALL", > "rows_per_partition":"NONE"} > at java.lang.Enum.valueOf(Enum.java:236) > at > org.apache.cassandra.config.CFMetaData$Caching.valueOf(CFMetaData.java:286) > at > org.apache.cassandra.config.CFMetaData.fromSchemaNoColumnsNoTriggers(CFMetaData.java:1713) > at > org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1793) > at > org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:307) > at > org.apache.cassandra.config.KSMetaData.fromSchema(KSMetaData.java:288) > at > org.apache.cassandra.db.DefsTables.loadFromKeyspace(DefsTables.java:131) > at > org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:529) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247337#comment-14247337 ] Rick Branson commented on CASSANDRA-8463: - This also seems suspect for creating overlap (single-table L1 compaction output placed into L2??): 2014-12-15_22:17:28.58953 INFO 22:17:28 Compacting (27d85a60-84a8-11e4-8b9d-79ee1dbb2761) [SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_001/ks1-DataByUserID_001-ka-42169-Data.db, level=1), ] 2014-12-15_22:17:57.84703 INFO 22:17:57 Compacted (27d85a60-84a8-11e4-8b9d-79ee1dbb2761) 1 sstables to [/data/cassandra/data/ks1/DataByUserID_001/ks1-DataByUserID_001-ka-42835,] to level=2. 33,581,350 bytes to 33,581,350 (~100% of original) in 29,256ms = 1.094670MB/s. 80,322 total partitions merged to 80,322. Partition merge counts were {1:80322, } > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247293#comment-14247293 ] Rick Branson commented on CASSANDRA-8463: - I applied the logging patch on one node and quickly noticed a suspect compaction. This contains 130 sstables from L3 and only 1 from L2. 2014-12-15_21:50:10.24752 INFO 21:50:10 Compacting (5750bed0-84a4-11e4-8177-4571ae47521c) [SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35274-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35305-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35330-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35409-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35512-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35286-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35323-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35366-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35407-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35558-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35206-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35570-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35247-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35361-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35335-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35261-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35588-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35368-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35545-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35311-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35231-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35416-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35567-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35265-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35500-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35516-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35304-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35358-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35256-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35550-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35393-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35533-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35494-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35538-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35239-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35544-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35401-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35502-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35495-Data.db, level=3), SSTableReader(path=/data/cassandra/data/ks1/DataByUserID_005/ks1-DataByUserID_005-ka-35553-Data.db, level=3), SSTableReader(path=/da
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247061#comment-14247061 ] Rick Branson commented on CASSANDRA-8463: - Writes had to be turned off to this cluster to avoid some errors we began to see. I started to see some SSTables appear in L0 w/o any flush activity. Could this be a clue? > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246905#comment-14246905 ] Rick Branson commented on CASSANDRA-8463: - It looks like this is just a regression with 2.1, not just an issue upgrading from 2.0 to 2.1. I emptied a node's SSTables and after ~12 hours it was back into the backlogged compaction state. I tried pushing the memtable size up very high, to where it was flushing ~100MB SSTables. This allowed the backlog to clear while there was only L0-L2 tables, but as soon as L3 tables appeared, it began to constantly compact again. After examining diffs between the 2.0 and 2.1 code, I also tried setting the tombstone compaction threshold to 1.0 to no avail. I'll see if I can get you some better logs with that patch, Marcus. > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > i
[jira] [Updated] (CASSANDRA-8463) Constant compaction under LCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-8463: Summary: Constant compaction under LCS (was: Upgrading 2.0 to 2.1 causes LCS to recompact all files) > Constant compaction under LCS > - > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: 0001-better-logging.patch, log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8475) Altering Table's tombstone_threshold stalls compaction until restart
[ https://issues.apache.org/jira/browse/CASSANDRA-8475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244887#comment-14244887 ] Rick Branson commented on CASSANDRA-8475: - Workaround for this is to disable & stop all compactions before making the schema change, then re-enable compactions after the schema change is made. > Altering Table's tombstone_threshold stalls compaction until restart > > > Key: CASSANDRA-8475 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8475 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson > > Compaction won't move forward on the table until a restart takes place and > the temp table is ignored. My hunch is that running CompactionTasks are > killed and there are still pre-opened temp files ref'd but they get deleted > with the CompactionTask dies? > Exception: > 2014-12-12_22:03:19.84572 ERROR 22:03:19 Exception in thread > Thread[CompactionExecutor:671,1,main] > 2014-12-12_22:03:19.84576 java.lang.RuntimeException: > java.io.FileNotFoundException: > /data/cassandra/data/ks1/DataByUserID_007/ks1-DataByUserID_007-tmplink-ka-21801-Data.db > (No such file or directory) > 2014-12-12_22:03:19.84576 at > org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84577 at > org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1895) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84578 at > org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84579 at > org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1681) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84579 at > org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1693) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84580 at > org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getScanners(LeveledCompactionStrategy.java:181) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84581 at > org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getScanners(WrappingCompactionStrategy.java:320) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84581 at > org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:340) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84582 at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:151) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84583 at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84583 at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84583 at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84584 at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84584 at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:232) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84585 at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > ~[na:1.7.0_65] > 2014-12-12_22:03:19.84586 at > java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_65] > 2014-12-12_22:03:19.84586 at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ~[na:1.7.0_65] > 2014-12-12_22:03:19.84587 at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_65] > 2014-12-12_22:03:19.84587 at java.lang.Thread.run(Thread.java:745) > [na:1.7.0_65] > 2014-12-12_22:03:19.84587 Caused by: java.io.FileNotFoundException: > /data/cassandra/data/ks1/DataByUserID_007/ks1-DataByUserID_007-tmplink-ka-21801-Data.db > (No such file or directory) > 2014-12-12_22:03:19.84588 at java.io.RandomAccessFile.open(Native > Method) ~[na:1.7.0_65] > 2014-12-12_22:03:19.84588 at > java.io.RandomAccessFile.(RandomAccessFile.java:241) ~[na:1.7.0_65] > 2014-12-12_22:03:19.84589 at > org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) > ~[apache-cassandra-2.1.2.jar:2.1.2] > 2014-12-12_22:03:19.84590 at > org.apache.cassandra.io.compress
[jira] [Created] (CASSANDRA-8475) Altering Table's tombstone_threshold stalls compaction until restart
Rick Branson created CASSANDRA-8475: --- Summary: Altering Table's tombstone_threshold stalls compaction until restart Key: CASSANDRA-8475 URL: https://issues.apache.org/jira/browse/CASSANDRA-8475 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Compaction won't move forward on the table until a restart takes place and the temp table is ignored. My hunch is that running CompactionTasks are killed and there are still pre-opened temp files ref'd but they get deleted with the CompactionTask dies? Exception: 2014-12-12_22:03:19.84572 ERROR 22:03:19 Exception in thread Thread[CompactionExecutor:671,1,main] 2014-12-12_22:03:19.84576 java.lang.RuntimeException: java.io.FileNotFoundException: /data/cassandra/data/ks1/DataByUserID_007/ks1-DataByUserID_007-tmplink-ka-21801-Data.db (No such file or directory) 2014-12-12_22:03:19.84576 at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:52) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84577 at org.apache.cassandra.io.sstable.SSTableReader.openDataReader(SSTableReader.java:1895) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84578 at org.apache.cassandra.io.sstable.SSTableScanner.(SSTableScanner.java:67) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84579 at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1681) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84579 at org.apache.cassandra.io.sstable.SSTableReader.getScanner(SSTableReader.java:1693) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84580 at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getScanners(LeveledCompactionStrategy.java:181) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84581 at org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getScanners(WrappingCompactionStrategy.java:320) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84581 at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.getScanners(AbstractCompactionStrategy.java:340) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84582 at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:151) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84583 at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84583 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84583 at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84584 at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84584 at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:232) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84585 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_65] 2014-12-12_22:03:19.84586 at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_65] 2014-12-12_22:03:19.84586 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_65] 2014-12-12_22:03:19.84587 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] 2014-12-12_22:03:19.84587 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] 2014-12-12_22:03:19.84587 Caused by: java.io.FileNotFoundException: /data/cassandra/data/ks1/DataByUserID_007/ks1-DataByUserID_007-tmplink-ka-21801-Data.db (No such file or directory) 2014-12-12_22:03:19.84588 at java.io.RandomAccessFile.open(Native Method) ~[na:1.7.0_65] 2014-12-12_22:03:19.84588 at java.io.RandomAccessFile.(RandomAccessFile.java:241) ~[na:1.7.0_65] 2014-12-12_22:03:19.84589 at org.apache.cassandra.io.util.RandomAccessReader.(RandomAccessReader.java:58) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84590 at org.apache.cassandra.io.compress.CompressedRandomAccessReader.(CompressedRandomAccessReader.java:77) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84590 at org.apache.cassandra.io.compress.CompressedThrottledReader.(CompressedThrottledReader.java:34) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84591 at org.apache.cassandra.io.compress.CompressedThrottledReader.open(CompressedThrottledReader.java:48) ~[apache-cassandra-2.1.2.jar:2.1.2] 2014-12-12_22:03:19.84591 ... 18 common frames omitted -- This message was sent by Atlassian JI
[jira] [Updated] (CASSANDRA-8463) Upgrading 2.0 to 2.1 causes LCS to recompact all files
[ https://issues.apache.org/jira/browse/CASSANDRA-8463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-8463: Attachment: log-for-8463.txt > Upgrading 2.0 to 2.1 causes LCS to recompact all files > -- > > Key: CASSANDRA-8463 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), > 144G RAM, solid-state storage. > Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. > Heap is 32G total, 4G newsize. > 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 > memtable_cleanup_threshold > concurrent_compactors: 20 >Reporter: Rick Branson >Assignee: Marcus Eriksson > Fix For: 2.1.3 > > Attachments: log-for-8463.txt > > > It appears that tables configured with LCS will completely re-compact > themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 > -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or > so, then starts building up, now with 50-100 tasks pending across the cluster > after 12 hours. These nodes are under heavy write load, but were easily able > to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't > think it's LCS in 2.1 actually being worse, just perhaps some different LCS > behavior that causes the layout of tables from 2.0 to prompt the compactor to > reorganize them? > The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB > SSTables due to the improved memtable setup in 2.1. Before I upgraded the > entire cluster to 2.1, I noticed the problem and tried several variations on > the flush size, thinking perhaps the larger tables in L0 were causing some > kind of cascading compactions. Even if they're sized roughly like the 2.0 > flushes were, same behavior occurs. I also tried both enabling & disabling > STCS in L0 with no real change other than L0 began to back up faster, so I > left the STCS in L0 enabled. > Tables are configured with 32MB sstable_size_in_mb, which was found to be an > improvement on the 160MB table size for compaction performance. Maybe this is > wrong now? Otherwise, the tables are configured with defaults. Compaction has > been unthrottled to help them catch-up. The compaction threads stay very > busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely > caught up yet. I'll update JIRA with status about their progress if anything > interesting happens. > From a node around 12 hours ago, around an hour after the upgrade, with 19 > pending compaction tasks: > SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] > SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] > SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] > SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] > Recently, with 41 pending compaction tasks: > SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] > SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] > SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] > SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] > SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] > More information about the use case: writes are roughly uniform across these > tables. The data is "sharded" across these 8 tables by key to improve > compaction parallelism. Each node receives up to 75,000 writes/sec sustained > at peak, and a small number of reads. This is a pre-production cluster that's > being warmed up with new data, so the low volume of reads (~100/sec per node) > is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8463) Upgrading 2.0 to 2.1 causes LCS to recompact all files
Rick Branson created CASSANDRA-8463: --- Summary: Upgrading 2.0 to 2.1 causes LCS to recompact all files Key: CASSANDRA-8463 URL: https://issues.apache.org/jira/browse/CASSANDRA-8463 Project: Cassandra Issue Type: Bug Components: Core Environment: Hardware is recent 2-socket, 16-core (x2 Hyperthreaded), 144G RAM, solid-state storage. Platform is Linux 3.2.51, Oracle JDK 64-bit 1.7.0_65. Heap is 32G total, 4G newsize. 8G/8G on-heap/off-heap memtables, offheap_buffer allocator, 0.5 memtable_cleanup_threshold concurrent_compactors: 20 Reporter: Rick Branson It appears that tables configured with LCS will completely re-compact themselves over some period of time after upgrading from 2.0 to 2.1 (2.0.11 -> 2.1.2, specifically). It starts out with <10 pending tasks for an hour or so, then starts building up, now with 50-100 tasks pending across the cluster after 12 hours. These nodes are under heavy write load, but were easily able to keep up in 2.0 (they rarely had >5 pending compaction tasks), so I don't think it's LCS in 2.1 actually being worse, just perhaps some different LCS behavior that causes the layout of tables from 2.0 to prompt the compactor to reorganize them? The nodes flushed ~11MB SSTables under 2.0. They're currently flushing ~36MB SSTables due to the improved memtable setup in 2.1. Before I upgraded the entire cluster to 2.1, I noticed the problem and tried several variations on the flush size, thinking perhaps the larger tables in L0 were causing some kind of cascading compactions. Even if they're sized roughly like the 2.0 flushes were, same behavior occurs. I also tried both enabling & disabling STCS in L0 with no real change other than L0 began to back up faster, so I left the STCS in L0 enabled. Tables are configured with 32MB sstable_size_in_mb, which was found to be an improvement on the 160MB table size for compaction performance. Maybe this is wrong now? Otherwise, the tables are configured with defaults. Compaction has been unthrottled to help them catch-up. The compaction threads stay very busy, with the cluster-wide CPU at 45% "nice" time. No nodes have completely caught up yet. I'll update JIRA with status about their progress if anything interesting happens. >From a node around 12 hours ago, around an hour after the upgrade, with 19 >pending compaction tasks: SSTables in each level: [6/4, 10, 105/100, 268, 0, 0, 0, 0, 0] SSTables in each level: [6/4, 10, 106/100, 271, 0, 0, 0, 0, 0] SSTables in each level: [1, 16/10, 105/100, 269, 0, 0, 0, 0, 0] SSTables in each level: [5/4, 10, 103/100, 272, 0, 0, 0, 0, 0] SSTables in each level: [4, 11/10, 105/100, 270, 0, 0, 0, 0, 0] SSTables in each level: [1, 12/10, 105/100, 271, 0, 0, 0, 0, 0] SSTables in each level: [1, 14/10, 104/100, 267, 0, 0, 0, 0, 0] SSTables in each level: [9/4, 10, 103/100, 265, 0, 0, 0, 0, 0] Recently, with 41 pending compaction tasks: SSTables in each level: [4, 13/10, 106/100, 269, 0, 0, 0, 0, 0] SSTables in each level: [4, 12/10, 106/100, 273, 0, 0, 0, 0, 0] SSTables in each level: [5/4, 11/10, 106/100, 271, 0, 0, 0, 0, 0] SSTables in each level: [4, 12/10, 103/100, 275, 0, 0, 0, 0, 0] SSTables in each level: [2, 13/10, 106/100, 273, 0, 0, 0, 0, 0] SSTables in each level: [3, 10, 104/100, 275, 0, 0, 0, 0, 0] SSTables in each level: [6/4, 11/10, 103/100, 269, 0, 0, 0, 0, 0] SSTables in each level: [4, 16/10, 105/100, 264, 0, 0, 0, 0, 0] More information about the use case: writes are roughly uniform across these tables. The data is "sharded" across these 8 tables by key to improve compaction parallelism. Each node receives up to 75,000 writes/sec sustained at peak, and a small number of reads. This is a pre-production cluster that's being warmed up with new data, so the low volume of reads (~100/sec per node) is just from automatic sampled data checks, otherwise we'd just use STCS :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8462) Upgrading a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes
Rick Branson created CASSANDRA-8462: --- Summary: Upgrading a 2.0 to 2.1 breaks CFMetaData on 2.0 nodes Key: CASSANDRA-8462 URL: https://issues.apache.org/jira/browse/CASSANDRA-8462 Project: Cassandra Issue Type: Bug Components: Core Reporter: Rick Branson Added a 2.1.2 node to a cluster running 2.0.11. Didn't make any schema changes. When I tried to reboot one of the 2.0 nodes, it failed to boot with this exception. Besides an obvious fix, any workarounds for this? {code} java.lang.IllegalArgumentException: No enum constant org.apache.cassandra.config.CFMetaData.Caching.{"keys":"ALL", "rows_per_partition":"NONE"} at java.lang.Enum.valueOf(Enum.java:236) at org.apache.cassandra.config.CFMetaData$Caching.valueOf(CFMetaData.java:286) at org.apache.cassandra.config.CFMetaData.fromSchemaNoColumnsNoTriggers(CFMetaData.java:1713) at org.apache.cassandra.config.CFMetaData.fromSchema(CFMetaData.java:1793) at org.apache.cassandra.config.KSMetaData.deserializeColumnFamilies(KSMetaData.java:307) at org.apache.cassandra.config.KSMetaData.fromSchema(KSMetaData.java:288) at org.apache.cassandra.db.DefsTables.loadFromKeyspace(DefsTables.java:131) at org.apache.cassandra.config.DatabaseDescriptor.loadSchemas(DatabaseDescriptor.java:529) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:270) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) {/code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7246) Gossip Null Pointer Exception when a cassandra instance in ring is restarted
[ https://issues.apache.org/jira/browse/CASSANDRA-7246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095965#comment-14095965 ] Rick Branson commented on CASSANDRA-7246: - Got some similar NPEs on restart on a node. This resulted in this node having a broken gossip table in memory (ugh) that didn't get fixed until a restart. This host was running 1.2.18. Presumably this is getEndpointStateForEndpoint returning null inside of getHostId. {code} java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:698) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1521) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1341) at org.apache.cassandra.gms.Gossiper.doNotifications(Gossiper.java:975) at org.apache.cassandra.gms.Gossiper.applyNewStates(Gossiper.java:966) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:924) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} {code} java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:698) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1521) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:1341) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2033) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:863) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:914) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} > Gossip Null Pointer Exception when a cassandra instance in ring is restarted > > > Key: CASSANDRA-7246 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7246 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 12 node ring of 1.2.x. > 11 of 12 are 1.2.15. > 1 is 1.2.16. >Reporter: Peter Haggerty >Assignee: Brandon Williams >Priority: Minor > Labels: gossip, nullpointerexception > Attachments: 7246.txt > > > 12 Cassandra instances, one per node. > 11 of the Cassandra instances are 1.2.15. > 1 of the Cassandra instances is 1.2.16. > One of the eleven 1.2.15 Cassandra instances is restarted (disable thrift, > gossip, then flush, drain, stop, start). > The 1.2.16 Cassandra instance noted this by throwing a Null Pointer > Exception. None of the 1.2.15 instances threw an exception and this is new > behavior that hasn't been observed before. > ERROR 02:18:06,009 Exception in thread Thread[GossipStage:1,5,main] > java.lang.NullPointerException > at org.apache.cassandra.gms.Gossiper.convict(Gossiper.java:264) > at > org.apache.cassandra.gms.FailureDetector.forceConviction(FailureDetector.java:246) > at > org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:37) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > INFO 02:18:23,402 Node /10.x.y.x is now part of the cluster > INFO 02:18:23,403 InetAddress /10.x.y.z is now UP > INFO 02:18:53,494 FatClient /10.x.y.z has been silent for 3ms, removing > from gossip > INFO 02:19:00,031 Handshaking version with /10.x.y.z -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7764) RFC: Range movements will "wake up" previously invisible data
Rick Branson created CASSANDRA-7764: --- Summary: RFC: Range movements will "wake up" previously invisible data Key: CASSANDRA-7764 URL: https://issues.apache.org/jira/browse/CASSANDRA-7764 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Presumably this has been going on as long as Cassandra has existed, but wanted to capture it here since it came up in an IRC discussion. This issue will probably show up on any cluster eventually. Scenario: 1) Start with a 3-node cluster, RF=1 2) A 4th node is added to the cluster 3) Data is deleted on ranges belonging to 4th node 4) Wait for GC to clean up some tombstones on 4th node 4) 4th node removed from cluster 5) Deleted data will reappear since it was dormant on the original 3 nodes This could definitely happen in many other situations where dormant data could exist such as inconsistencies that aren't resolved before range movement, but the case above seemed the most reasonable to propose as a real-world problem. The cleanup operation can be used to get rid of the dormant data, but from my experience people don't run cleanup unless they're low on disk. It's definitely not a best practice for data integrity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7758) Some gossip messages are very slow to process on vnode clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-7758: Attachment: 7758-1-v2.0.txt > Some gossip messages are very slow to process on vnode clusters > --- > > Key: CASSANDRA-7758 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7758 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Rick Branson > Attachments: 7758-1-v1.2.txt, 7758-1-v2.0.txt > > > Both StorageService.getChangedRangesForLeaving and > PendingRangeCalculatorService.calculatePendingRanges clone the TokenMetadata > map once per range that a node covers. This is a very expensive operation on > a cluster with vnodes since the TokenMetadata map can easily be tens of > thousands of entries, even for a small cluster, and this runs thousands of > times for each changing node. This can cause big cluster-wide slowdowns & > problems in the GossipStage and PendingRangeCalculator thread pools. > Processing just one message can generate enough garbage to pause the entire > cluster simultaneously when one of these is being propagated. Patches > forthcoming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7758) Some gossip messages are very slow to process on vnode clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-7758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-7758: Attachment: 7758-1-v1.2.txt > Some gossip messages are very slow to process on vnode clusters > --- > > Key: CASSANDRA-7758 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7758 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Rick Branson > Attachments: 7758-1-v1.2.txt > > > Both StorageService.getChangedRangesForLeaving and > PendingRangeCalculatorService.calculatePendingRanges clone the TokenMetadata > map once per range that a node covers. This is a very expensive operation on > a cluster with vnodes since the TokenMetadata map can easily be tens of > thousands of entries, even for a small cluster, and this runs thousands of > times for each changing node. This can cause big cluster-wide slowdowns & > problems in the GossipStage and PendingRangeCalculator thread pools. > Processing just one message can generate enough garbage to pause the entire > cluster simultaneously when one of these is being propagated. Patches > forthcoming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7758) Some gossip messages are very slow to process on vnode clusters
Rick Branson created CASSANDRA-7758: --- Summary: Some gossip messages are very slow to process on vnode clusters Key: CASSANDRA-7758 URL: https://issues.apache.org/jira/browse/CASSANDRA-7758 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Rick Branson Both StorageService.getChangedRangesForLeaving and PendingRangeCalculatorService.calculatePendingRanges clone the TokenMetadata map once per range that a node covers. This is a very expensive operation on a cluster with vnodes since the TokenMetadata map can easily be tens of thousands of entries, even for a small cluster, and this runs thousands of times for each changing node. This can cause big cluster-wide slowdowns & problems in the GossipStage and PendingRangeCalculator thread pools. Processing just one message can generate enough garbage to pause the entire cluster simultaneously when one of these is being propagated. Patches forthcoming. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6335) Hints broken for nodes that change broadcast address
[ https://issues.apache.org/jira/browse/CASSANDRA-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091142#comment-14091142 ] Rick Branson commented on CASSANDRA-6335: - Easiest way to repro might be to terminate a bootstrap before it finishes: {code} INFO [GossipTasks:1] 2014-08-05 16:56:41,331 Gossiper.java (line 658) FatClient /10.210.165.35 has been silent for 3ms, removing from gossip ERROR [MutationStage:1] 2014-08-05 16:56:41,341 CassandraDaemon.java (line 199) Exception in thread Thread[MutationStage:1,5,main] "java.lang.AssertionError: Missing host ID for "10.210.165.35 at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:898) at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:869) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1987) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} > Hints broken for nodes that change broadcast address > > > Key: CASSANDRA-6335 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6335 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson > > When a node changes it's broadcast address, the transition process works > properly, but hints that are destined for it can't be delivered because of > the address change. It produces an exception: > java.lang.AssertionError: Missing host ID for 10.1.60.22 > at > org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:598) > at > org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:567) > at > org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1679) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7664) IndexOutOfBoundsException thrown during repair
[ https://issues.apache.org/jira/browse/CASSANDRA-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-7664: Attachment: other-errors.txt Got this on a bootstrapping node shortly after a stream failed. See other-errors.txt file for the relevant log information. > IndexOutOfBoundsException thrown during repair > -- > > Key: CASSANDRA-7664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7664 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: RHEL 6.1 > Casandra 1.2.3 - 1.2.18 >Reporter: ZhongYu > Attachments: $O[TOX~GGUZRW~IHPYPEG{0.jpg, other-errors.txt > > > I was running repair command with moderate read and write load at the same > time. And I found tens of IndexOutOfBoundsException in system log as follows: > {quote} > ERROR [Thread-6056] 2013-05-22 14:47:59,416 CassandraDaemon.java (line132) > Exception in thread Thread[Thread-6056,5,main] > java.lang.IndexOutOfBoundsException > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:75) > at > org.apache.cassandra.streaming.compress.CompressedInputStream$Reader.runMayThrow(CompressedInputStream.java:151) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at java.lang.Thread.run(Thread.java:662) > {quote} > I read the source code of CompressedInputStream.java and found there surely > will throw IndexOutOfBoundsException in the following situation: > {code:title=CompressedInputStream.java|borderStyle=solid} > // Part of CompressedInputStream.java start from Line 139 > protected void runMayThrow() throws Exception > { > byte[] compressedWithCRC; > while (chunks.hasNext()) > { > CompressionMetadata.Chunk chunk = chunks.next(); > int readLength = chunk.length + 4; // read with CRC > compressedWithCRC = new byte[readLength]; > int bufferRead = 0; > while (bufferRead < readLength) > bufferRead += source.read(compressedWithCRC, bufferRead, > readLength - bufferRead); > dataBuffer.put(compressedWithCRC); > } > } > {code} > If read function read nothing because the end of the stream has been reached, > it will return -1, thus bufferRead can be negetive. In the next circle, read > function will throw IndexOutOfBoundsException because bufferRead is negetive. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
[ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-7704: Attachment: other-errors.txt There wasn't anything in the logs that indicated *why* the failure happened. Th I attached anything suspect. The IndexOutOfBoundsException occurred on the bootstrapping node *after* the stream failure occurred on the node that was streaming out. There was a CompactionTask that ran at 2014-08-05 18:00:25,804 (4 minutes before the StreamOut task) that tried to compact that SSTable that referenced in the FileNotFoundException. No other log messages related to that file though. > FileNotFoundException during STREAM-OUT triggers 100% CPU usage > --- > > Key: CASSANDRA-7704 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Benedict > Attachments: 7704.txt, backtrace.txt, other-errors.txt > > > See attached backtrace which was what triggered this. This stream failed and > then ~12 seconds later it emitted that exception. At that point, all CPUs > went to 100%. A thread dump shows all the ReadStage threads stuck inside > IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
Rick Branson created CASSANDRA-7704: --- Summary: FileNotFoundException during STREAM-OUT triggers 100% CPU usage Key: CASSANDRA-7704 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Attachments: backtrace.txt See attached backtrace which was what triggered this. This stream failed and then ~12 seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced(). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7432) Add new CMS GC flags to cassandra_env.sh for JVM later than 1.7.0_60
[ https://issues.apache.org/jira/browse/CASSANDRA-7432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084927#comment-14084927 ] Rick Branson commented on CASSANDRA-7432: - Very happy with this in production. Enabling -XX:+CMSParallelInitialMarkEnabled and -XX:+CMSEdenChunksRecordAlways has *drastically* improved the STW portions of CMS for our nodes with 4GB+ young gen. Not seeing any downsides to enabling these options either. We're also running a few more options that have trade-offs which could be considered: -XX:+CMSScavengeBeforeRemark which has helped reduce the pause times of STW portions of CMS remark with large eden spaces, since it's dependent on how much live garbage is in the eden space. If CMS is ran infrequently (every few minutes or so) I can't see how this option could hurt people even with small eden spaces. -XX:CMSWaitDuration=1. This is how long the CMS will wait for a young gen collection before running the initial mark. It waits for a collection because the stop-the-world initial mark time depends on how many live objects are in eden space. The default is 2000ms, which seems low to me. The trade-off is that if it waits too long, CMS could run out of time and a promotion failure could occur. This seems unlikely though at 10 seconds in any even remotely healthy scenario. > Add new CMS GC flags to cassandra_env.sh for JVM later than 1.7.0_60 > > > Key: CASSANDRA-7432 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7432 > Project: Cassandra > Issue Type: Improvement > Components: Packaging >Reporter: graham sanderson >Assignee: Brandon Williams > Fix For: 1.2.19 > > Attachments: 7432.txt > > > The new flags in question are as follows: > {code} > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > {code} > Given we already have > {code} > JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" > JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" > JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" > JVM_OPTS="$JVM_OPTS -XX:+UseTLAB" > if [ "$JVM_ARCH" = "64-Bit" ] ; then > JVM_OPTS="$JVM_OPTS -XX:+UseCondCardMark" > fi > {code} > The assumption would be that people are at least running on large number CPU > cores/threads > I would therefore recommend defaulting these flags if available - the only > two possible downsides for {{+CMSEdenChunksRecordAlways}}: > 1) There is a new very short (probably un-contended) lock in the "slow" (non > TLAB) eden allocation path with {{+CMSEdenChunksRecordAlways}}. I haven't > detected this timing wise - this is the "slow" path after all > 2) If you are running with {{-XX:-UseCMSCompactAtFullCollection}} (not the > default) *and* you call {{System.gc()}} then {{+CMSEdenChunksRecordAlways}} > will expose you to a possible seg fault: (see > [http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8021809]) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5239) Asynchronous (non-blocking) StorageProxy
[ https://issues.apache.org/jira/browse/CASSANDRA-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072265#comment-14072265 ] Rick Branson commented on CASSANDRA-5239: - [~benedict]: Not sure I follow how those are related? This ticket is important for operations that need to coordinate with other nodes over the network. It might be worth just focusing this on the native protocol since Thrift is basically deprecated. The CQL3 execution process is totally synchronous starting at the Netty worker thread. I know there's been some talk about turning into a pipeline where results stream through incrementally. This is obviously non-trivial, but async execution seems like it would fit naturally into a refactor of CQL3 to stream results. > Asynchronous (non-blocking) StorageProxy > > > Key: CASSANDRA-5239 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5239 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 2.0 beta 1 >Reporter: Vijay >Assignee: Sylvain Lebresne > Labels: performance > Fix For: 3.0 > > > Problem Statement: > Currently we have "rpc_min_threads, rpc_max_threads"/ > "native_transport_min_threads/native_transport_max_threads" all of the > threads in the TPE are blocking and takes resources, the threads are mostly > sleeping. Increasing the Context switch costs. > Details: > We should change StorageProxy methods to provide a callback which contains > the location where the results has to be written. When the response arrive > StorageProxy callback can write the results directly into the connection. > Timeouts can be handled in the same way. > Fixing Netty should be trivial with some refactor in the storage proxy > (currently it is one method call for sending the request and waiting) we need > callback. > Fixing Thrift may be harder because thrift calls the method and expects a > return value. We might need to write a custom Codec on Netty for thrift > support, which can potentially do callbacks (A Custom codec may be similar to > http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html > but we dont know details about it). Another option is to update thrift to > have a callback. > FYI, The motivation for this ticket is from another project which i am > working on with similar Proxy (blocking Netty transport) and making it Async > gave us 2x throughput improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-5239) Asynchronous (non-blocking) StorageProxy
[ https://issues.apache.org/jira/browse/CASSANDRA-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14072002#comment-14072002 ] Rick Branson commented on CASSANDRA-5239: - That is an epic patch [~slebresne]. Throwing my +1 on this one. It would have profound reliability & performance impact. Not having to monopolize a thread for every running operation in a largely network and I/O bound system is big. It would really help smooth out the impact from proxy calls for temporary slowdowns on replicas. Unless I'm misunderstanding, this is the remaining synchronous piece in the network pipeline. > Asynchronous (non-blocking) StorageProxy > > > Key: CASSANDRA-5239 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5239 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 2.0 beta 1 >Reporter: Vijay >Assignee: Sylvain Lebresne > Labels: performance > Fix For: 3.0 > > > Problem Statement: > Currently we have "rpc_min_threads, rpc_max_threads"/ > "native_transport_min_threads/native_transport_max_threads" all of the > threads in the TPE are blocking and takes resources, the threads are mostly > sleeping. Increasing the Context switch costs. > Details: > We should change StorageProxy methods to provide a callback which contains > the location where the results has to be written. When the response arrive > StorageProxy callback can write the results directly into the connection. > Timeouts can be handled in the same way. > Fixing Netty should be trivial with some refactor in the storage proxy > (currently it is one method call for sending the request and waiting) we need > callback. > Fixing Thrift may be harder because thrift calls the method and expects a > return value. We might need to write a custom Codec on Netty for thrift > support, which can potentially do callbacks (A Custom codec may be similar to > http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html > but we dont know details about it). Another option is to update thrift to > have a callback. > FYI, The motivation for this ticket is from another project which i am > working on with similar Proxy (blocking Netty transport) and making it Async > gave us 2x throughput improvement. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7594) Disruptor Thrift server worker thread pool not adjustable
[ https://issues.apache.org/jira/browse/CASSANDRA-7594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071975#comment-14071975 ] Rick Branson commented on CASSANDRA-7594: - The ideal situation would be if rpc_min_threads and rpc_max_threads still worked, at least roughly. The problem with implementing this is that the TDisruptorServer only creates fixed-size thread pools specified by numWorkersPerSelector. The variable-size thread pools are useful because they allow working around temporary congestion problems by using a large number of executor threads, but fall back to a smaller thread pool when unneeded (i.e. 128 -> 2048). An example of this is when a two replicas for the same range are simultaneously slow or unresponsive using QUORUM on RF=3. The RPC thread pool is *rapidly* consumed, which can cause fallout to other, unrelated requests. > Disruptor Thrift server worker thread pool not adjustable > - > > Key: CASSANDRA-7594 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7594 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Pavel Yaskevich > > For the THsHaDisruptorServer, there may not be enough threads to run blocking > StorageProxy methods. The current number of worker threads is hardcoded at 2 > per selector, so 2 * numAvailableProcessors(), or 64 threads on a 16-core > hyperthreaded machine. StorageProxy methods block these threads, so this puts > an upper bound on the throughput if hsha is enabled. If operations take 10ms > on average, the node can only handle a maximum of 6,400 operations per > second. This is a regression from hsha on 1.2.x, where the thread pool was > tunable using rpc_min_threads and rpc_max_threads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7594) Disruptor Thrift server worker thread pool not adjustable
Rick Branson created CASSANDRA-7594: --- Summary: Disruptor Thrift server worker thread pool not adjustable Key: CASSANDRA-7594 URL: https://issues.apache.org/jira/browse/CASSANDRA-7594 Project: Cassandra Issue Type: Bug Reporter: Rick Branson For the THsHaDisruptorServer, there may not be enough threads to run blocking StorageProxy methods. The current number of worker threads is hardcoded at 2 per selector, so 2 * numAvailableProcessors(), or 64 threads on a 16-core hyperthreaded machine. StorageProxy methods block these threads, so this puts an upper bound on the throughput if hsha is enabled. If operations take 10ms on average, the node can only handle a maximum of 6,400 operations per second. This is a regression from hsha on 1.2.x, where the thread pool was tunable using rpc_min_threads and rpc_max_threads. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7402) limit the on heap memory available to requests
[ https://issues.apache.org/jira/browse/CASSANDRA-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055168#comment-14055168 ] Rick Branson commented on CASSANDRA-7402: - The more "practical" thing here is to make sure nothing on the read/write paths can potentially allocate big chunks of heap. If one looks at the read/write paths for something like PostgreSQL, they have made sure that the only potentially large allocations (sorts, hash-joins, etc) occur in a very controlled manner and have hard limits attached to them that can be configured. This seems like a reasonable thing to ask for at this stage in Cassandra's life. > limit the on heap memory available to requests > -- > > Key: CASSANDRA-7402 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7402 > Project: Cassandra > Issue Type: Improvement >Reporter: T Jake Luciani > Fix For: 3.0 > > > When running a production cluster one common operational issue is quantifying > GC pauses caused by ongoing requests. > Since different queries return varying amount of data you can easily get your > self into a situation where you Stop the world from a couple of bad actors in > the system. Or more likely the aggregate garbage generated on a single node > across all in flight requests causes a GC. > We should be able to set a limit on the max heap we can allocate to all > outstanding requests and track the garbage per requests to stop this from > happening. It should increase a single nodes availability substantially. > In the yaml this would be > {code} > total_request_memory_space_mb: 400 > {code} > It would also be nice to have either a log of queries which generate the most > garbage so operators can track this. Also a histogram. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050677#comment-14050677 ] Rick Branson commented on CASSANDRA-7486: - Mad anecdotes: We ran with G1 enabled for around 4 days in a 33-node cluster running 1.2.17 on JDK7u45 that has around a 1:5 read:write ratio. We tried a few different configurations with short durations, but most of the time we ran it with the out-of-the-box G1 configuration on a 20G heap and 32 parallel GC threads (16 core, 32 hyperthreaded). There were some somewhat scary bugs fixed in 7u60 that ultimately caused me to roll back to the CMS collector after the experiment. * The experiment pointed out that our young gen was basically too small and was pulling latency up significantly. When we returned back to CMS, I doubled new size from 800M -> 1600M. We had moved to new hardware and hadn't taken the time to sit down and play with GC settings. This cut our mean latency dramatically as perceived from the client, ~50% for writes and ~30% for reads, similar to what we saw with G1. I was quite thrilled with this result. * I tried both 100ms and 150ms pause times targets with 12G, 16G, and 20G heaps, and while these resulted in slightly lower mean latency (~5-10%), Mixed GC activity caused P99s to suffer greatly. There's compelling evidence that the 200ms default is nearly ideal for the way the G1 algorithm works in its current incarnation. * We basically needed a 20G heap to make G1 work well for us, since by default G1 will use up to half of the max heap for eden space and Cassandra needs quite a large old gen to stay happy. G1 appears to need a much larger eden space to work efficiently, sizes that would make ParNew die in a fire. GCs of the eden space were impressively fast, with a ~10G eden space taking ~120ms on average to collect. * G1's huge eden space was helpful working around some issues with compaction on hints CF which had dozens of very wide partitions, hundreds of thousands of cells each. * Overall, at the default 200ms pause time target, we didn't see much of an increase in CPU usage over CMS. In the end, my tests basically told us that G1 requires a larger heap to get the same results with *far* less tuning. If there are GC issues, it seems like in the vast majority of cases G1 can either eliminate them or G1 makes it easy to just workaround them by cranking up the heap size. Someone should probably test G1 with a variable-sized heap since it's designed to give back RAM when it thinks it doesn't need it. That might or might not actually work. While we didn't test this, a configuration of G1 + heap size min of 1/8 RAM and max of 1/2 RAM might make a really nice default for Cassandra at some point. > Compare CMS and G1 pause times > -- > > Key: CASSANDRA-7486 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 > Project: Cassandra > Issue Type: Test > Components: Config >Reporter: Jonathan Ellis >Assignee: Ryan McGuire > Fix For: 2.1.0 > > > See > http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning > and https://twitter.com/rbranson/status/482113561431265281 > May want to default 2.1 to G1. > 2.1 is a different animal from 2.0 after moving most of memtables off heap. > Suspect this will help G1 even more than CMS. (NB this is off by default but > needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6621) STCS fallback is not optimal when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043976#comment-14043976 ] Rick Branson commented on CASSANDRA-6621: - As I said above, it causes bootstraps to require 2x disk space. One of the stated goals of LCS is to avoid the 2x disk space requirement, which the fallback violates. > STCS fallback is not optimal when bootstrapping > --- > > Key: CASSANDRA-6621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6621 > Project: Cassandra > Issue Type: Improvement >Reporter: Bartłomiej Romański >Assignee: Marcus Eriksson >Priority: Minor > Labels: compaction, streaming > Fix For: 2.0.9 > > Attachments: 0001-wip-keep-sstable-level-when-bootstrapping.patch > > > The initial discussion started in (closed) CASSANDRA-5371. I've rewritten my > last comment here... > After streaming (e.g. during boostrap) Cassandra places all sstables at L0. > At the end of the process we end up with huge number of sstables at the > lowest level. > Currently, Cassandra falls back to STCS until the number of sstables at L0 > reaches the reasonable level (32 or something). > I'm not sure if falling back to STCS is the best way to handle this > particular situation. I've read the comment in the code and I'm aware why it > is a good thing to do if we have to many sstables at L0 as a result of too > many random inserts. We have a lot of sstables, each of them covers the whole > ring, there's simply no better option. > However, after the bootstrap situation looks a bit different. The loaded > sstables already have very small ranges! We just have to tidy up a bit and > everything should be OK. STCS ignores that completely and after a while we > have a bit less sstables but each of them covers the whole ring instead of > just a small part. I believe that in that case letting LCS do the job is a > better option that allowing STCS mix everything up before. > Is there a way to disable STCS fallback? I'd like to test that scenario in > practice during our next bootstrap... > Does Cassandra really have to put streamed sstables at L0? The only thing we > have to assure is that sstables at any given level do not overlap. If we > stream different regions from different nodes how can we get any overlaps? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6621) STCS fallback is not optimal when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043921#comment-14043921 ] Rick Branson commented on CASSANDRA-6621: - How about we add a switch to enable/disable this and disable it by default in 2.0 then? This behavior is a net negative for people using LCS *unless* they happen to have long "bursts" of *very* high write volume. > STCS fallback is not optimal when bootstrapping > --- > > Key: CASSANDRA-6621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6621 > Project: Cassandra > Issue Type: Improvement >Reporter: Bartłomiej Romański >Assignee: Marcus Eriksson >Priority: Minor > Labels: compaction, streaming > Fix For: 2.0.9 > > Attachments: 0001-wip-keep-sstable-level-when-bootstrapping.patch > > > The initial discussion started in (closed) CASSANDRA-5371. I've rewritten my > last comment here... > After streaming (e.g. during boostrap) Cassandra places all sstables at L0. > At the end of the process we end up with huge number of sstables at the > lowest level. > Currently, Cassandra falls back to STCS until the number of sstables at L0 > reaches the reasonable level (32 or something). > I'm not sure if falling back to STCS is the best way to handle this > particular situation. I've read the comment in the code and I'm aware why it > is a good thing to do if we have to many sstables at L0 as a result of too > many random inserts. We have a lot of sstables, each of them covers the whole > ring, there's simply no better option. > However, after the bootstrap situation looks a bit different. The loaded > sstables already have very small ranges! We just have to tidy up a bit and > everything should be OK. STCS ignores that completely and after a while we > have a bit less sstables but each of them covers the whole ring instead of > just a small part. I believe that in that case letting LCS do the job is a > better option that allowing STCS mix everything up before. > Is there a way to disable STCS fallback? I'd like to test that scenario in > practice during our next bootstrap... > Does Cassandra really have to put streamed sstables at L0? The only thing we > have to assure is that sstables at any given level do not overlap. If we > stream different regions from different nodes how can we get any overlaps? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6621) STCS fallback is not optimal when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026916#comment-14026916 ] Rick Branson commented on CASSANDRA-6621: - This is definitely sub-optimal for us as well, just in terms of time spent compacting after a bootstrap/rebuild. We never get behind in L0 during normal operation. In addition, it causes one of the nice things about LCS to be invalidated as well, which is that you never have to worry about having double the disk space to compact. Bootstraps cause large compactions (~50% of the size of the CF), which means we need a ton of extra disk on bootstrap just to build nodes. > STCS fallback is not optimal when bootstrapping > --- > > Key: CASSANDRA-6621 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6621 > Project: Cassandra > Issue Type: Improvement >Reporter: Bartłomiej Romański >Priority: Minor > > The initial discussion started in (closed) CASSANDRA-5371. I've rewritten my > last comment here... > After streaming (e.g. during boostrap) Cassandra places all sstables at L0. > At the end of the process we end up with huge number of sstables at the > lowest level. > Currently, Cassandra falls back to STCS until the number of sstables at L0 > reaches the reasonable level (32 or something). > I'm not sure if falling back to STCS is the best way to handle this > particular situation. I've read the comment in the code and I'm aware why it > is a good thing to do if we have to many sstables at L0 as a result of too > many random inserts. We have a lot of sstables, each of them covers the whole > ring, there's simply no better option. > However, after the bootstrap situation looks a bit different. The loaded > sstables already have very small ranges! We just have to tidy up a bit and > everything should be OK. STCS ignores that completely and after a while we > have a bit less sstables but each of them covers the whole ring instead of > just a small part. I believe that in that case letting LCS do the job is a > better option that allowing STCS mix everything up before. > Is there a way to disable STCS fallback? I'd like to test that scenario in > practice during our next bootstrap... > Does Cassandra really have to put streamed sstables at L0? The only thing we > have to assure is that sstables at any given level do not overlap. If we > stream different regions from different nodes how can we get any overlaps? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7262) During streaming: java.lang.AssertionError: Reference counter -1
[ https://issues.apache.org/jira/browse/CASSANDRA-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012902#comment-14012902 ] Rick Branson commented on CASSANDRA-7262: - Also seeing this NPE: at org.apache.cassandra.streaming.StreamTransferTask.createMessageForRetry(StreamTransferTask.java:106) at org.apache.cassandra.streaming.StreamSession.retry(StreamSession.java:525) at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:401) at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) at java.lang.Thread.run(Thread.java:744) Same thing or should I open another ticket? Seems like it's in the same vein. > During streaming: java.lang.AssertionError: Reference counter -1 > > > Key: CASSANDRA-7262 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7262 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.07, x86-64 Ubuntu 12.04.4, Oracle java > 1.7.0_45 >Reporter: Duncan Sands >Assignee: Joshua McKenzie >Priority: Minor > Fix For: 2.0.9 > > Attachments: 7262_v1.txt, 7262_v2.txt, system.log.gz > > > Got this assertion failure this weekend during repair: > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,332 StreamSession.java > (line 420) [Stream #3a3ac8a2-dd50-11e3-b3c1-6bf6dccd6457] Streaming error > occurred > java.lang.RuntimeException: Outgoing stream handler has been closed > at > org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:170) > at > org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:483) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:372) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) > at java.lang.Thread.run(Thread.java:744) > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,350 CassandraDaemon.java > (line 198) Exception in thread Thread[STREAM-IN-/192.168.21.14,5,RMI Runtime] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-79827-Data.db > at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) > at > org.apache.cassandra.streaming.StreamTransferTask.abort(StreamTransferTask.java:80) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:322) > at > org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:425) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:300) > at java.lang.Thread.run(Thread.java:744) > followed by a few more (the reference counter got down to -3). Got the same > kind of assertion failure on one other node (in a different data centre; > there are 21 nodes altogether distributed over 4 data centres). > I've attached the relevant part of the log. It starts quite a bit before the > assertion failure at the first exception on this node ("Cannot proceed on > repair because a neighbor ... is dead"), and finishes a few hours afterwards > when the node was restarted. > Edit: The following Reference counter assertion failures followed the 1st on > a different file and have a different stack trace: > ERROR [CompactionExecutor:382] 2014-05-17 01:17:53,157 CassandraDaemon.java > (line 198) Exception in thread Thread[CompactionExecutor:382,1,main] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-83888-Data.db >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReferences(SSTableReader.java:1429) >at > org.apache.cassandra.db.compaction.CompactionController.close(CompactionController.java:207) >at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:220) >at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) >at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) >at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) >at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) >... -- This message was sent
[jira] [Comment Edited] (CASSANDRA-7262) During streaming: java.lang.AssertionError: Reference counter -1
[ https://issues.apache.org/jira/browse/CASSANDRA-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012015#comment-14012015 ] Rick Branson edited comment on CASSANDRA-7262 at 5/29/14 3:28 AM: -- Would this cause all of the ReadStage threads to spin on decrementAndGet in org.apache.cassandra.io.sstable.SSTableReader.releaseReference? at java.util.concurrent.atomic.AtomicInteger.decrementAndGet(AtomicInteger.java:222) at org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1093) at org.apache.cassandra.io.sstable.SSTableReader.acquireReferences(SSTableReader.java:1420) at org.apache.cassandra.db.ColumnFamilyStore.markReferenced(ColumnFamilyStore.java:1458) at org.apache.cassandra.db.ColumnFamilyStore.markReferenced(ColumnFamilyStore.java:1473) at org.apache.cassandra.db.CollationController.collectTimeOrderedData(CollationController.java:69) was (Author: rbranson): Would this cause all of the ReadStage threads to spin on decrementAndGet in org.apache.cassandra.io.sstable.SSTableReader.releaseReference? > During streaming: java.lang.AssertionError: Reference counter -1 > > > Key: CASSANDRA-7262 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7262 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.07, x86-64 Ubuntu 12.04.4, Oracle java > 1.7.0_45 >Reporter: Duncan Sands >Assignee: Joshua McKenzie >Priority: Minor > Fix For: 2.0.9 > > Attachments: 7262_v1.txt, 7262_v2.txt, system.log.gz > > > Got this assertion failure this weekend during repair: > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,332 StreamSession.java > (line 420) [Stream #3a3ac8a2-dd50-11e3-b3c1-6bf6dccd6457] Streaming error > occurred > java.lang.RuntimeException: Outgoing stream handler has been closed > at > org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:170) > at > org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:483) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:372) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) > at java.lang.Thread.run(Thread.java:744) > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,350 CassandraDaemon.java > (line 198) Exception in thread Thread[STREAM-IN-/192.168.21.14,5,RMI Runtime] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-79827-Data.db > at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) > at > org.apache.cassandra.streaming.StreamTransferTask.abort(StreamTransferTask.java:80) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:322) > at > org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:425) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:300) > at java.lang.Thread.run(Thread.java:744) > followed by a few more (the reference counter got down to -3). Got the same > kind of assertion failure on one other node (in a different data centre; > there are 21 nodes altogether distributed over 4 data centres). > I've attached the relevant part of the log. It starts quite a bit before the > assertion failure at the first exception on this node ("Cannot proceed on > repair because a neighbor ... is dead"), and finishes a few hours afterwards > when the node was restarted. > Edit: The following Reference counter assertion failures followed the 1st on > a different file and have a different stack trace: > ERROR [CompactionExecutor:382] 2014-05-17 01:17:53,157 CassandraDaemon.java > (line 198) Exception in thread Thread[CompactionExecutor:382,1,main] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-83888-Data.db >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReferences(SSTableReader.java:1429) >at > org.apache.cassandra.db.compaction.CompactionController.close(CompactionController.java:207) >at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:220) >at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) >at org.apache.cassandra.utils.WrappedRunnable.run(Wrapped
[jira] [Commented] (CASSANDRA-7262) During streaming: java.lang.AssertionError: Reference counter -1
[ https://issues.apache.org/jira/browse/CASSANDRA-7262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012015#comment-14012015 ] Rick Branson commented on CASSANDRA-7262: - Would this cause all of the ReadStage threads to spin on decrementAndGet in org.apache.cassandra.io.sstable.SSTableReader.releaseReference? > During streaming: java.lang.AssertionError: Reference counter -1 > > > Key: CASSANDRA-7262 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7262 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.07, x86-64 Ubuntu 12.04.4, Oracle java > 1.7.0_45 >Reporter: Duncan Sands >Assignee: Joshua McKenzie >Priority: Minor > Fix For: 2.0.9 > > Attachments: 7262_v1.txt, 7262_v2.txt, system.log.gz > > > Got this assertion failure this weekend during repair: > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,332 StreamSession.java > (line 420) [Stream #3a3ac8a2-dd50-11e3-b3c1-6bf6dccd6457] Streaming error > occurred > java.lang.RuntimeException: Outgoing stream handler has been closed > at > org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:170) > at > org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:483) > at > org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:372) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:289) > at java.lang.Thread.run(Thread.java:744) > ERROR [STREAM-IN-/192.168.21.14] 2014-05-17 01:17:52,350 CassandraDaemon.java > (line 198) Exception in thread Thread[STREAM-IN-/192.168.21.14,5,RMI Runtime] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-79827-Data.db > at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) > at > org.apache.cassandra.streaming.StreamTransferTask.abort(StreamTransferTask.java:80) > at > org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:322) > at > org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:425) > at > org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:300) > at java.lang.Thread.run(Thread.java:744) > followed by a few more (the reference counter got down to -3). Got the same > kind of assertion failure on one other node (in a different data centre; > there are 21 nodes altogether distributed over 4 data centres). > I've attached the relevant part of the log. It starts quite a bit before the > assertion failure at the first exception on this node ("Cannot proceed on > repair because a neighbor ... is dead"), and finishes a few hours afterwards > when the node was restarted. > Edit: The following Reference counter assertion failures followed the 1st on > a different file and have a different stack trace: > ERROR [CompactionExecutor:382] 2014-05-17 01:17:53,157 CassandraDaemon.java > (line 198) Exception in thread Thread[CompactionExecutor:382,1,main] > java.lang.AssertionError: Reference counter -1 for > /mnt/ssd1/cassandra/data/ldn_production/historical_accounts/ldn_production-historical_accounts-jb-83888-Data.db >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReference(SSTableReader.java:1107) >at > org.apache.cassandra.io.sstable.SSTableReader.releaseReferences(SSTableReader.java:1429) >at > org.apache.cassandra.db.compaction.CompactionController.close(CompactionController.java:207) >at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:220) >at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) >at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) >at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) >at > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) >... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-7309) Explicit limit for maximum number of Thrift connections
Rick Branson created CASSANDRA-7309: --- Summary: Explicit limit for maximum number of Thrift connections Key: CASSANDRA-7309 URL: https://issues.apache.org/jira/browse/CASSANDRA-7309 Project: Cassandra Issue Type: Improvement Reporter: Rick Branson The 'sync' Thrift server gets this by way of "rpc_max_threads" in the YAML file, but the 'hsha' server misses out here. Not having a limit here is dangerous because it allows badly behaving clients to use up all the file handles and prevents Cassandra from opening any files, leaving the daemon dead in the water. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002142#comment-14002142 ] Rick Branson commented on CASSANDRA-6285: - Haven't been able to repro in over 5 days. We're considering the enable_reallocate_buffers.txt patch fixed and production-ready. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.8 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch, enable_reallocate_buffers.txt > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Reopened] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson reopened CASSANDRA-6285: - Reproduced In: 2.0.6 (was: 2.0.2) This is not fixed. Still seeing the same exception running 2.0.6. ERROR [CompactionExecutor:7] 2014-05-09 17:59:58,640 CassandraDaemon.java (line 196) Exception in thread Thread[CompactionExecutor:7,1,main] java.lang.RuntimeException: Last written key DecoratedKey(132126721345628486111245439753727165857, 0f3b67f2) >= current key DecoratedKey(37424530135488872684523334498941679307, 196b70ab) writing into /data/cassandra/data/redacted/Redacted/redacted-Redacted-tmp-jb-156533-Data.db at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:142) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:165) at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:197) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.8 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch, enable_reallocate_buffers.txt > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996899#comment-13996899 ] Rick Branson commented on CASSANDRA-6285: - I did some more digging around on our cluster that was running 2.0.6 when it saw the corruption: it took anywhere from a few hours to 48 hours for the first compaction with the out of order key exception to throw. These nodes are receiving thousands of writes per second, so it's not going to be trivially reproducible. We've been running one of the nodes with 2.0.8-tenative + enable_reallocate_buffers.txt and will report back once we've reached 72 hours and are comfortable rolling this out wide to our own clusters. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.8 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch, enable_reallocate_buffers.txt > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996712#comment-13996712 ] Rick Branson commented on CASSANDRA-6285: - I think what might help this specific quality issue out is just moving to the new HSHA implementation entirely in a later version and removing the choice. The new HSHA supposedly eliminates the performance issues that made it not a good default choice, so it appears as if there's no advantage to having the other choices. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.8 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch, enable_reallocate_buffers.txt > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6815) Decided if we want to bring back thrift HSHA in 2.0.7
[ https://issues.apache.org/jira/browse/CASSANDRA-6815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993814#comment-13993814 ] Rick Branson commented on CASSANDRA-6815: - FYI I reopened CASSANDRA-6285 because the "fix" doesn't fix the issue at all. > Decided if we want to bring back thrift HSHA in 2.0.7 > - > > Key: CASSANDRA-6815 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6815 > Project: Cassandra > Issue Type: Improvement >Reporter: Sylvain Lebresne >Assignee: Pavel Yaskevich > > This is the followup of CASSANDRA-6285, to decide what we want to do > regarding thrift servers moving forward. My reading of CASSANDRA-6285 > suggests that the possible options includes: > # bring back the old HSHA implementation from 1.2 as "hsha" and make the > disruptor implementation be "disruptor_hsha". > # use the new TThreadedSelectorServer from thrift as "hsha", making the > disruptor implementation "disruptor_hsha" as above > # just wait for Pavel to fix the disruptor implementation for off-heap > buffers to switch back to that, keeping on-heap buffer until then. > # keep on-heap buffer for the disruptor implementation and do nothing > particular. > I could be missing some options and we can probably do some mix of those. I > don't have a particular opinion to offer on the matter. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6996) Setting severity via JMX broken
[ https://issues.apache.org/jira/browse/CASSANDRA-6996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977068#comment-13977068 ] Rick Branson commented on CASSANDRA-6996: - LGTM, works on a test box. > Setting severity via JMX broken > --- > > Key: CASSANDRA-6996 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6996 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Rick Branson >Assignee: Vijay >Priority: Minor > Fix For: 2.0.8 > > Attachments: 0001-CASSANDRA-6996.patch > > > Looks like setting the Severity attribute in the DynamicEndpointSnitch via > JMX is a no-op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6757) SliceQueryFilter can accept multiple ColumnSlice but not exposed to thrift
[ https://issues.apache.org/jira/browse/CASSANDRA-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964961#comment-13964961 ] Rick Branson commented on CASSANDRA-6757: - *begs for 2.1 version of this* > SliceQueryFilter can accept multiple ColumnSlice but not exposed to thrift > -- > > Key: CASSANDRA-6757 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6757 > Project: Cassandra > Issue Type: New Feature >Reporter: Edward Capriolo >Assignee: Edward Capriolo > Fix For: 3.0 > > Attachments: Cassandra-6757.1.diff > > > Thrift has getSlice() which takes a key and a predicate. SliceQueryFilter can > accept multiple ColumnSlice(s). Users should have access to this capability > already found in Cassandra. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6961) nodes should go into hibernate when join_ring is false
[ https://issues.apache.org/jira/browse/CASSANDRA-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964626#comment-13964626 ] Rick Branson commented on CASSANDRA-6961: - Very happy about this. It will make a ton of things easier from an operations perspective (bringing up new DCs, bringing up hosts after long-ish maintenance), but also interested in using this to potentially have dedicated coordinator nodes that are separate from storage. We find ourselves CPU bound on more capacity-constrained and expensive storage-class hardware. Most of this CPU time is spent on request coordination. Moving this work to cheap "diskless" application-class hardware is much more ideal and will allow us to maximize the capacity of our storage nodes. > nodes should go into hibernate when join_ring is false > -- > > Key: CASSANDRA-6961 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6961 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Brandon Williams >Assignee: Brandon Williams > Fix For: 2.0.7, 2.1 beta2 > > Attachments: 6961.txt > > > The impetus here is this: a node that was down for some period and comes back > can serve stale information. We know from CASSANDRA-768 that we can't just > wait for hints, and know that tangentially related CASSANDRA-3569 prevents us > from having the node in a down (from the FD's POV) state handle streaming. > We can *almost* set join_ring to false, then repair, and then join the ring > to narrow the window (actually, you can do this and everything succeeds > because the node doesn't know it's a member yet, which is probably a bit of a > bug.) If instead we modified this to put the node in hibernate, like > replace_address does, it could work almost like replace, except you could run > a repair (manually) while in the hibernate state, and then flip to normal > when it's done. > This won't prevent the staleness 100%, but it will greatly reduce the chance > if the node has been down a significant amount of time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6996) Setting severity via JMX broken
Rick Branson created CASSANDRA-6996: --- Summary: Setting severity via JMX broken Key: CASSANDRA-6996 URL: https://issues.apache.org/jira/browse/CASSANDRA-6996 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Priority: Minor Looks like setting the Severity attribute in the DynamicEndpointSnitch via JMX is a no-op. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (CASSANDRA-6992) Bootstrap on vnodes clusters can cause stampeding/storm behavior
Rick Branson created CASSANDRA-6992: --- Summary: Bootstrap on vnodes clusters can cause stampeding/storm behavior Key: CASSANDRA-6992 URL: https://issues.apache.org/jira/browse/CASSANDRA-6992 Project: Cassandra Issue Type: Bug Environment: Various vnodes-enabled clusters in EC2, m1.xlarge and hi1.4xlarge, ~3000-8000 tokens. Reporter: Rick Branson Assuming this is an issue with vnodes clusters because SSTableReader#getPositionsForRanges is more expensive to compute with 256x the ranges, but could be wrong. On even well-provisioned hosts, this can cause a severe spike in network throughput & CPU utilization from a storm of flushes, which impacts long-tail times pretty badly. On weaker hosts (like m1.xlarge with ~500GB of data), it can result in minutes of churn while the node gets through StreamOut#createPendingFiles. This *might* be better in 2.0, but it's probably still reproducible because the bootstrapping node sends out all of it's streaming requests at once. I'm thinking that this could be staggered at the bootstrapping node to avoid the simultaneous spike across the whole cluster. Not sure on how to stagger it besides something very naive like one-at-a-time with a pause. Maybe this should also be throttled in StreamOut#createPendingFiles on the out-streaming host? Any thoughts? >From the stack dump of one of our weaker nodes that was struggling for a few >minutes just starting the StreamOut: "MiscStage:1" daemon prio=10 tid=0x0292f000 nid=0x688 runnable [0x7f7b03df6000] java.lang.Thread.State: RUNNABLE at org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:361) at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371) at org.apache.cassandra.io.sstable.IndexHelper$IndexInfo.deserialize(IndexHelper.java:187) at org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:125) at org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:889) at org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:790) at org.apache.cassandra.io.sstable.SSTableReader.getPositionsForRanges(SSTableReader.java:730) at org.apache.cassandra.streaming.StreamOut.createPendingFiles(StreamOut.java:172) at org.apache.cassandra.streaming.StreamOut.transferSSTables(StreamOut.java:157) at org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:148) at org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:116) at org.apache.cassandra.streaming.StreamRequestVerbHandler.doVerb(StreamRequestVerbHandler.java:44) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6685) Nodes never bootstrap if schema is empty
[ https://issues.apache.org/jira/browse/CASSANDRA-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944211#comment-13944211 ] Rick Branson commented on CASSANDRA-6685: - Ran into this today trying to test out some new machines. It's not good that we've just been sitting on this for 6 weeks now. > Nodes never bootstrap if schema is empty > > > Key: CASSANDRA-6685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6685 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Richard Low >Assignee: Brandon Williams > Fix For: 1.2.16 > > > Since 1.2.15, bootstrap never completes if the schema is empty. The > bootstrapping node endlessly prints: > bq. {{INFO 12:37:44,863 JOINING: waiting for schema information to complete}} > until you add something to the schema (i.e. create a keyspace). > The problem looks to be caused by CASSANDRA-6648, where > MigrationManager.isReadForBootstrap() was changed to: > bq. {{return Schema.instance.getVersion() != null && > !Schema.emptyVersion.equals(Schema.instance.getVersion());}} > This is wrong since > {{Schema.emptyVersion.equals(Schema.instance.getVersion())}} is always true > if there is no schema. > We need some different logic for determining when the schema is propagated. > I haven't tested, but I expect this issue appears in 2.0.5 too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923514#comment-13923514 ] Rick Branson commented on CASSANDRA-6285: - [~xedin]: off-heap for disruptor. I think that we should really consider bringing back the old HSHA implementation from 1.2 as the "hsha" and allow switching to the disruptor implementation as another rpc_server_type for those that want to try it out. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, compaction_test.py > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923466#comment-13923466 ] Rick Branson commented on CASSANDRA-6285: - [~xedin]: We had perf issues with the disruptor as well (sudden spikes of CPU to 100%) + this so I just wanted to get production away from it ASAP. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, compaction_test.py > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923441#comment-13923441 ] Rick Branson commented on CASSANDRA-6285: - We put in the TThreadedSelectorServer patch from Marcus. On top of that, to get our response times down from the 10x what they should be, I rolled out a larger hard-coded selector thread pool size of 256 (instead of the # of processors -- a measly 16). This is shaping up nicely. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, compaction_test.py > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data
[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923245#comment-13923245 ] Rick Branson commented on CASSANDRA-6285: - Unfortunately this hit us during our 2.0.5 upgrade and 'sync' is not an option for the # of connections we have per node (tried this). We've been running Marcus' patch in prod and limping along on it, but it looks like the requestInvoke() override is causing the requests to get executed on the selector pool (which is limited by CPU #) instead of the executor service so our response times are pretty bad. The lack of anything showing up in the JMX for the executor service definitely points towards this. > 2.0 HSHA server introduces corrupt data > --- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 >Reporter: David Sauer >Assignee: Pavel Yaskevich >Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, compaction_test.py > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-6592) IllegalArgumentException when Preparing Statements
[ https://issues.apache.org/jira/browse/CASSANDRA-6592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6592: Priority: Critical (was: Minor) > IllegalArgumentException when Preparing Statements > -- > > Key: CASSANDRA-6592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6592 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Tyler Hobbs >Assignee: Lyuben Todorov >Priority: Critical > Fix For: 1.2.14, 2.0.5 > > > When preparing a lot of statements with the python native driver, I > occasionally get an error response with an error that corresponds to the > following stacktrace in the cassandra logs: > {noformat} > ERROR [Native-Transport-Requests:126] 2014-01-11 13:58:05,503 > ErrorMessage.java (line 210) Unexpected exception during request > java.lang.IllegalArgumentException > at > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap.checkArgument(ConcurrentLinkedHashMap.java:259) > at > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$BoundedEntryWeigher.weightOf(ConcurrentLinkedHashMap.java:1448) > at > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap.put(ConcurrentLinkedHashMap.java:764) > at > com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap.put(ConcurrentLinkedHashMap.java:743) > at > org.apache.cassandra.cql3.QueryProcessor.storePreparedStatement(QueryProcessor.java:255) > at > org.apache.cassandra.cql3.QueryProcessor.prepare(QueryProcessor.java:221) > at > org.apache.cassandra.transport.messages.PrepareMessage.execute(PrepareMessage.java:77) > at > org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287) > at > org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) > at > org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > {noformat} > Looking at the CLHM source, this means we're giving the statement a weight > that's less than 1. I'll also note that these errors frequently happen in > clumps of 2 or 3 at a time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6569) Batchlog replays copy the entire batchlog table into the heap
[ https://issues.apache.org/jira/browse/CASSANDRA-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869620#comment-13869620 ] Rick Branson commented on CASSANDRA-6569: - Also, would it be possible to get some INFO-level logging (like HH has) on batchlog replay while we're here? > Batchlog replays copy the entire batchlog table into the heap > - > > Key: CASSANDRA-6569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6569 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > Fix For: 1.2.14, 2.0.5 > > Attachments: 6569.txt > > > The current batchlog replay path will read the entire batchlog table into the > heap. This is pretty bad. This was compounded by CASSANDRA-5762, which caused > the SELECT statement used by the batchlog replay to bring the entire row into > memory instead of just the selected columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6569) Batchlog replays copy the entire batchlog table into the heap
[ https://issues.apache.org/jira/browse/CASSANDRA-6569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869618#comment-13869618 ] Rick Branson commented on CASSANDRA-6569: - Looks like deleteBatch got switched off the CQL code-path. This seems inconsistent from the rest of this replay logic. > Batchlog replays copy the entire batchlog table into the heap > - > > Key: CASSANDRA-6569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6569 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > Fix For: 1.2.14, 2.0.5 > > Attachments: 6569.txt > > > The current batchlog replay path will read the entire batchlog table into the > heap. This is pretty bad. This was compounded by CASSANDRA-5762, which caused > the SELECT statement used by the batchlog replay to bring the entire row into > memory instead of just the selected columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CASSANDRA-6569) Batchlog replays copy the entire batchlog table into the heap
Rick Branson created CASSANDRA-6569: --- Summary: Batchlog replays copy the entire batchlog table into the heap Key: CASSANDRA-6569 URL: https://issues.apache.org/jira/browse/CASSANDRA-6569 Project: Cassandra Issue Type: Bug Components: Core Reporter: Rick Branson Assignee: Aleksey Yeschenko The current batchlog replay path will read the entire batchlog table into the heap. This is pretty bad. This was compounded by CASSANDRA-5762, which caused the SELECT statement used by the batchlog replay to bring the entire row into memory instead of just the selected columns. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (CASSANDRA-6551) Rack-aware batchlog replication
[ https://issues.apache.org/jira/browse/CASSANDRA-6551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6551: Component/s: Core Reproduced In: 2.0.4, 1.2.13 Fix Version/s: 2.0.5 1.2.14 > Rack-aware batchlog replication > --- > > Key: CASSANDRA-6551 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6551 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > Fix For: 1.2.14, 2.0.5 > > > Right now the batchlog replication code just randomly picks 2 other nodes in > the same DC, regardless of rack. Ideally we'd pick 2 replicas in other racks > to achieve higher fault tolerance. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (CASSANDRA-6551) Rack-aware batchlog replication
Rick Branson created CASSANDRA-6551: --- Summary: Rack-aware batchlog replication Key: CASSANDRA-6551 URL: https://issues.apache.org/jira/browse/CASSANDRA-6551 Project: Cassandra Issue Type: Improvement Reporter: Rick Branson Assignee: Aleksey Yeschenko Right now the batchlog replication code just randomly picks 2 other nodes in the same DC, regardless of rack. Ideally we'd pick 2 replicas in other racks to achieve higher fault tolerance. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6485) NPE in calculateNaturalEndpoints
[ https://issues.apache.org/jira/browse/CASSANDRA-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848118#comment-13848118 ] Rick Branson commented on CASSANDRA-6485: - LGTM. > NPE in calculateNaturalEndpoints > > > Key: CASSANDRA-6485 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6485 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Russell Alexander Spitzer >Assignee: Jonathan Ellis > Fix For: 1.2.13 > > Attachments: 6485.txt > > > I was running a test where I added a new data center to an existing cluster. > Test outline: > Start 25 Node DC1 > Keyspace Setup Replication 3 > Begin insert against DC1 Using Stress > While the inserts are occuring > Start up 25 Node DC2 > Alter Keyspace to include Replication in 2nd DC > Run rebuild on DC2 > Wait for stress to finish > Run repair on Cluster > ... Some other operations > Although there are no issues with smaller clusters or clusters without > vnodes, Larger setups with vnodes seem to consistently see the following > exception in the logs as well as a write operation failing for each > exception. Usually this happens between 1-8 times during an experiment. > The exceptions/failures are Occurring when DC2 is brought online but *before* > any alteration of the Keyspace. All of the exceptions are happening on DC1 > nodes. One of the exceptions occurred on a seed node though this doesn't seem > to be the case most of the time. > While the test was running, nodetool was run every second to get cluster > status. At no time did any nodes report themselves as down. > {code} > ystem_logs-107.21.186.208/system.log-ERROR [Thrift:1] 2013-12-13 06:19:52,647 > CustomTThreadPoolServer.java (line 217) Error occurred during processing of > message. > system_logs-107.21.186.208/system.log:java.lang.NullPointerException > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:128) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2624) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:190) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:866) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:849) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:749) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3690) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.getResult(Cassandra.java:3678) > system_logs-107.21.186.208/system.log-at > org.apache.thrift.ProcessFunction.process(ProcessFunction.java:32) > system_logs-107.21.186.208/system.log-at > org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:34) > system_logs-107.21.186.208/system.log-at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199) > system_logs-107.21.186.208/system.log-at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > system_logs-107.21.186.208/system.log-at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > system_logs-107.21.186.208/system.log-at > java.lang.Thread.run(Thread.java:724) > {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (CASSANDRA-6488) Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters
Rick Branson created CASSANDRA-6488: --- Summary: Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters Key: CASSANDRA-6488 URL: https://issues.apache.org/jira/browse/CASSANDRA-6488 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Assignee: Aleksey Yeschenko Attachments: 6488-rbranson-patch.txt, graph (21).png The cloneTokenOnlyMap call in StorageProxy.getBatchlogEndpoints causes enormous amounts of CPU to be consumed on clusters with many vnodes. I created a patch to cache this data as a workaround and deployed it to a production cluster with 15,000 tokens. CPU consumption drop to 1/5th. This highlights the overall issues with cloneOnlyTokenMap() calls on vnodes clusters. I'm including the maybe-not-the-best-quality workaround patch to use as a reference, but cloneOnlyTokenMap is a systemic issue and every place it's called should probably be investigated. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (CASSANDRA-6488) Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6488: Attachment: 6488-rbranson-patch.txt > Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters > - > > Key: CASSANDRA-6488 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6488 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > Attachments: 6488-rbranson-patch.txt, graph (21).png > > > The cloneTokenOnlyMap call in StorageProxy.getBatchlogEndpoints causes > enormous amounts of CPU to be consumed on clusters with many vnodes. I > created a patch to cache this data as a workaround and deployed it to a > production cluster with 15,000 tokens. CPU consumption drop to 1/5th. This > highlights the overall issues with cloneOnlyTokenMap() calls on vnodes > clusters. I'm including the maybe-not-the-best-quality workaround patch to > use as a reference, but cloneOnlyTokenMap is a systemic issue and every place > it's called should probably be investigated. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Updated] (CASSANDRA-6488) Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters
[ https://issues.apache.org/jira/browse/CASSANDRA-6488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6488: Attachment: graph (21).png CPU usage dropping on a production cluster after the attached patch is rolled out. > Batchlog writes consume unnecessarily large amounts of CPU on vnodes clusters > - > > Key: CASSANDRA-6488 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6488 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Aleksey Yeschenko > Attachments: 6488-rbranson-patch.txt, graph (21).png > > > The cloneTokenOnlyMap call in StorageProxy.getBatchlogEndpoints causes > enormous amounts of CPU to be consumed on clusters with many vnodes. I > created a patch to cache this data as a workaround and deployed it to a > production cluster with 15,000 tokens. CPU consumption drop to 1/5th. This > highlights the overall issues with cloneOnlyTokenMap() calls on vnodes > clusters. I'm including the maybe-not-the-best-quality workaround patch to > use as a reference, but cloneOnlyTokenMap is a systemic issue and every place > it's called should probably be investigated. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (CASSANDRA-6408) Efficient multi-partition mutations
Rick Branson created CASSANDRA-6408: --- Summary: Efficient multi-partition mutations Key: CASSANDRA-6408 URL: https://issues.apache.org/jira/browse/CASSANDRA-6408 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Rick Branson At the SF Summit this year, Sylvain suggested that C* drops a very large amount of write throughput on the floor for multi-partition mutations because they are broken into RowMutations and executed individually. Stress tests that I've run show 10X the throughput for 1-row x 1000-col writes versus 1000-row x 1-col writes. We have a core high-write-skew use case which involves fan-out-on-write against hundreds or up to thousands of keys at a time currently implemented in Redis as it doesn't seem to suffer from the issue. Would love to be able to move this to C* at some point. This is likely a pretty large undertaking as it would require touching a large portion of the write path, but I figure I'd put it here for comment and/or debate at this point. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833150#comment-13833150 ] Rick Branson commented on CASSANDRA-6345: - LGTM! > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.13, 2.0.4 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345-v3.txt, 6345-v4.txt, 6345-v5.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-5357) Query cache
[ https://issues.apache.org/jira/browse/CASSANDRA-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832901#comment-13832901 ] Rick Branson commented on CASSANDRA-5357: - Perhaps an anecdote from a production system might help find a simple, yet useful improvement to the row cache. Facebook's TAO distributed storage system supports a data model called "assocs" which are basically just graph edges, and nodes assigned to a given assoc ID hold a write-through cache of the state. The assoc storage can be roughly considered a more use-case specific CF. For large assocs with many thousands of edges, TAO only maintains the tail of the assoc in memory, as those tend to be the most "interesting" portions of data. More of the details are discussed in the linked paper[1]. Perhaps instead of a total overhaul, what's really needed to evolve the row cache by modifying it to only cache the head of the row and it's bounds. In contrast to the complexity of trying to match queries & mutations to a set of serialized query filter objects, the cache only needs to maintain one interval for each row at most. This would provide a very simple write-through story. After reviewing our production wide row use cases, they seem to fall into two camps. The first and most read-performance sensitive is vastly skewed towards reads on the head of the row (>90% of the time) with a fixed limit. The second is randomly distributed slice queries which would not seem to provide a very good cache hit rate either way. [1] https://www.usenix.org/conference/atc13/technical-sessions/papers/bronson) > Query cache > --- > > Key: CASSANDRA-5357 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5357 > Project: Cassandra > Issue Type: Bug >Reporter: Jonathan Ellis >Assignee: Vijay > > I think that most people expect the row cache to act like a query cache, > because that's a reasonable model. Caching the entire partition is, in > retrospect, not really reasonable, so it's not surprising that it catches > people off guard, especially given the confusion we've inflicted on ourselves > as to what a "row" constitutes. > I propose replacing it with a true query cache. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832783#comment-13832783 ] Rick Branson edited comment on CASSANDRA-6345 at 11/26/13 5:34 PM: --- Thanks for taking the time to explain the consistency story. It makes perfect sense. My defensiveness comment suggested bumping the version number each time the TM write lock is released, which would be in addition to the existing invalidations. You're probably a much better gauge on the usefulness of this, so up to you. Really nice that the v5 patch is so compact. Two minor comments: the endpointsLock declaration is still in there, and not to be all nitpicky but there are two typos in the comments ("wo we keep" and "clone got invalidted"). was (Author: rbranson): Thanks for taking the time to explain the consistency story. It makes perfect sense. My defensiveness comment suggested bumping the version number (this is practically free) each time the TM write lock is released, which would be in addition to the existing invalidations. You're probably a much better gauge on the usefulness of this, so up to you. Really nice that the v5 patch is so compact. Two minor comments: the endpointsLock declaration is still in there, and not to be all nitpicky but there are two typos in the comments ("wo we keep" and "clone got invalidted"). > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.13 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345-v3.txt, 6345-v4.txt, 6345-v5.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832783#comment-13832783 ] Rick Branson commented on CASSANDRA-6345: - Thanks for taking the time to explain the consistency story. It makes perfect sense. My defensiveness comment suggested bumping the version number (this is practically free) each time the TM write lock is released, which would be in addition to the existing invalidations. You're probably a much better gauge on the usefulness of this, so up to you. Really nice that the v5 patch is so compact. Two minor comments: the endpointsLock declaration is still in there, and not to be all nitpicky but there are two typos in the comments ("wo we keep" and "clone got invalidted"). > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.13 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345-v3.txt, 6345-v4.txt, 6345-v5.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13828097#comment-13828097 ] Rick Branson commented on CASSANDRA-6345: - +100 at removing those pub/sub callbacks :) The concurrency issues I bring up are probably because I'm unfamiliar with the "guarantees" needed by TokenMetadata updates. It looks like the current release code is subject to the issue I brought up, where method calls on TokenMetadata that change state return successfully before all threads applying mutations have "seen" the update. There will be some mutations in progress that are using "stale" token data to apply writes even after TokenMetadata write methods returns as successful. So this does not appear to be a regression, but I'm just being overly cautious having been burned by these sort of double-caching scenarios before. You bring up the point that over-broad operations are ok, and I agree, but I'm more concerned about operations that are too narrow. It seems that unless I'm missing something either is possible with the current release code, and thus these patches as well (including mine). TokenMetadata#updateNormalTokens is (implicitly) relying on the removeFromMoving call to bump the version, but the tokenToEndpointMap is updated afterwards, which means internal data is updated after the version is bumped. IMHO to be defensive, any time the write lock is acquired in TokenMetadata, the version should be bumped in the finally block before the lock is released. I don't think this is exposing a bug in the existing patch though, because cloneOnlyTokenMap will be blocked until the write lock is released in the finally block. Is the idea with the striped lock on the endpoint cache in AbstractReplicationStrategy to help smooth out the stampede effect when the "global" lock on the cached TM gets released after the fill? How much do you think it's worth the extra complexity? FWIW, my v2 patch suffers from this issue and it hasn't reared itself in production. The write load for the machines in the cluster I've been looking at is comparatively low though compared to many others at 6-7k/sec peak on an 8-core box. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.13 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345-v3.txt, 6345-v4.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, s
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825614#comment-13825614 ] Rick Branson commented on CASSANDRA-6345: - I like the simpler approach. I still think the callbacks for invalidation are asking for it ;) I also think perhaps the stampede lock should be more explicit than a synchronized lock on "this" to prevent unintended blocking from future modifications. Either way, I think the only material concern I have is the order that TokenMetadata changes get applied to the caches in AbstractReplicationStrategy instances. Shouldn't the invalidation take place on all threads in all instances of AbstractReplicationStrategy before returning from an endpoint-mutating write operation in TokenMetadata? It seems as if just setting the cache to empty would allow a period of time where TokenMetadata write methods had returned but not all threads have seen the mutation yet because they are still holding onto the old clone of TM. This might be alright though, I'm not sure. Thoughts? > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345-v3.txt, 6345.txt, half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (CASSANDRA-6354) No cleanup of excess gossip connections
Rick Branson created CASSANDRA-6354: --- Summary: No cleanup of excess gossip connections Key: CASSANDRA-6354 URL: https://issues.apache.org/jira/browse/CASSANDRA-6354 Project: Cassandra Issue Type: Bug Reporter: Rick Branson Priority: Minor While trying to cut off communication between two nodes, I noticed a production node had >300 connections active established to another node on the storage port. It looks like there's no check to keep these limited, so they'll just sit around forever. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Attachment: 6345-rbranson-v2.txt Attached a new patch with the deadlock fixed. We're running this on a production cluster. The primary issue was the callback for invalidation from TokenMetadata to all of the registered AbstractReplicationStrategy instances. This was asking for it anyway, so in the patch I replaced the "push" invalidation with simple versioning of the TokenMetadata endpoints. TokenMetadata bumps it's version number each time the cache would need to be invalidated, and AbstractReplicationStrategy checks it's version when it needs to do a read, invalidating if necessary. This gets the invalidation out of the gossip threads and into the RPC threads, which is probably a good thing. The only thing I'm not super crazy about is the extra hot path read lock acquisition on TokenMetadata.getEndpointVersion(), which might be avoidable. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson-v2.txt, 6345-rbranson.txt, 6345-v2.txt, > 6345.txt, half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13823274#comment-13823274 ] Rick Branson commented on CASSANDRA-6345: - Unfortunately both of the patches suffer from a deadlock, since the invalidation and fill are wrapped up in TokenMetadata's locks. T1 acquires cache read lock T2 acquires TokenMetadata write lock T1 acquires cache write lock on miss T2 is blocked on cache write lock trying to invalidate T1 is blocked on TokenMetadata read lock trying to cloneOnlyTokenMap to fill the cache Trying to work on a fix. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson.txt, 6345-v2.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13822027#comment-13822027 ] Rick Branson commented on CASSANDRA-6345: - Well, I started writing the patch this morning and I don't write multi-threaded Java code every day, so I'm overly careful ;) The only theoretical advantage to my patch is that it allows concurrent readers. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson.txt, 6345-v2.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Attachment: half-way-thru-6345-rbranson-patch-applied.png CPU user% graph during the rollout of the patch I attached on 1 DC (15 nodes) of the cluster. Around ~21:05 the patch starts to roll out and spikes are seen. The node in question receives the patch at ~21:30, and afterwards the spikes are gone. The rollout finishes at ~21:45. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-v2.txt, 6345.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Attachment: 6345-rbranson.txt > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson.txt, 6345-v2.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Attachment: (was: 6345.txt) > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-rbranson.txt, 6345-v2.txt, 6345.txt, > half-way-thru-6345-rbranson-patch-applied.png > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Attachment: 6345.txt Attached a patch we deployed to production that fixed the issue. > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson >Assignee: Jonathan Ellis > Fix For: 1.2.12, 2.0.3 > > Attachments: 6345-v2.txt, 6345.txt, 6345.txt > > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
[ https://issues.apache.org/jira/browse/CASSANDRA-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-6345: Reproduced In: 1.2.11 > Endpoint cache invalidation causes CPU spike (on vnode rings?) > -- > > Key: CASSANDRA-6345 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 > Project: Cassandra > Issue Type: Bug > Environment: 30 nodes total, 2 DCs > Cassandra 1.2.11 > vnodes enabled (256 per node) >Reporter: Rick Branson > > We've observed that events which cause invalidation of the endpoint cache > (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy > result in several seconds of thundering herd behavior on the entire cluster. > A thread dump shows over a hundred threads (I stopped counting at that point) > with a backtrace like this: > at java.net.Inet4Address.getAddress(Inet4Address.java:288) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) > at > org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) > at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) > at java.util.TreeMap.getEntry(TreeMap.java:322) > at java.util.TreeMap.get(TreeMap.java:255) > at > com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) > at > com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) > at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) > at > com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) > at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) > at > org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) > at > org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) > at > org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) > at > org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) > It looks like there's a large amount of cost in the > TokenMetadata.cloneOnlyTokenMap that > AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is > a cache miss for an endpoint. It seems as if this would only impact clusters > with large numbers of tokens, so it's probably a vnodes-only issue. > Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the > cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), > wrapping it with a lock to prevent stampedes, and clearing it in > clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (CASSANDRA-6345) Endpoint cache invalidation causes CPU spike (on vnode rings?)
Rick Branson created CASSANDRA-6345: --- Summary: Endpoint cache invalidation causes CPU spike (on vnode rings?) Key: CASSANDRA-6345 URL: https://issues.apache.org/jira/browse/CASSANDRA-6345 Project: Cassandra Issue Type: Bug Environment: 30 nodes total, 2 DCs Cassandra 1.2.11 vnodes enabled (256 per node) Reporter: Rick Branson We've observed that events which cause invalidation of the endpoint cache (update keyspace, add/remove nodes, etc) in AbstractReplicationStrategy result in several seconds of thundering herd behavior on the entire cluster. A thread dump shows over a hundred threads (I stopped counting at that point) with a backtrace like this: at java.net.Inet4Address.getAddress(Inet4Address.java:288) at org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:106) at org.apache.cassandra.locator.TokenMetadata$1.compare(TokenMetadata.java:103) at java.util.TreeMap.getEntryUsingComparator(TreeMap.java:351) at java.util.TreeMap.getEntry(TreeMap.java:322) at java.util.TreeMap.get(TreeMap.java:255) at com.google.common.collect.AbstractMultimap.put(AbstractMultimap.java:200) at com.google.common.collect.AbstractSetMultimap.put(AbstractSetMultimap.java:117) at com.google.common.collect.TreeMultimap.put(TreeMultimap.java:74) at com.google.common.collect.AbstractMultimap.putAll(AbstractMultimap.java:273) at com.google.common.collect.TreeMultimap.putAll(TreeMultimap.java:74) at org.apache.cassandra.utils.SortedBiMultiValMap.create(SortedBiMultiValMap.java:60) at org.apache.cassandra.locator.TokenMetadata.cloneOnlyTokenMap(TokenMetadata.java:598) at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:104) at org.apache.cassandra.service.StorageService.getNaturalEndpoints(StorageService.java:2671) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:375) It looks like there's a large amount of cost in the TokenMetadata.cloneOnlyTokenMap that AbstractReplicationStrategy.getNaturalEndpoints is calling each time there is a cache miss for an endpoint. It seems as if this would only impact clusters with large numbers of tokens, so it's probably a vnodes-only issue. Proposal: In AbstractReplicationStrategy.getNaturalEndpoints(), cache the cloned TokenMetadata instance returned by TokenMetadata.cloneOnlyTokenMap(), wrapping it with a lock to prevent stampedes, and clearing it in clearEndpointCache(). Thoughts? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Comment Edited] (CASSANDRA-6335) Hints broken for nodes that change broadcast address
[ https://issues.apache.org/jira/browse/CASSANDRA-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820415#comment-13820415 ] Rick Branson edited comment on CASSANDRA-6335 at 11/12/13 8:41 PM: --- It looks like it stalls the hint delivery thread pool indefinitely. Restarting the node frees it up and it starts delivering hints again. I'm rolling this cluster to workaround the issue. EDIT: disregard the "stalling" issue... there were just a large # of hints that needed to be delivered to a specific node (because it was down for a bit for some maintenance) clogging up the pool and confusing the issue. was (Author: rbranson): It looks like it stalls the hint delivery thread pool indefinitely. Restarting the node frees it up and it starts delivering hints again. I'm rolling this cluster to workaround the issue. > Hints broken for nodes that change broadcast address > > > Key: CASSANDRA-6335 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6335 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson > > When a node changes it's broadcast address, the transition process works > properly, but hints that are destined for it can't be delivered because of > the address change. It produces an exception: > java.lang.AssertionError: Missing host ID for 10.1.60.22 > at > org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:598) > at > org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:567) > at > org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1679) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6335) Hints broken for nodes that change broadcast address
[ https://issues.apache.org/jira/browse/CASSANDRA-6335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820415#comment-13820415 ] Rick Branson commented on CASSANDRA-6335: - It looks like it stalls the hint delivery thread pool indefinitely. Restarting the node frees it up and it starts delivering hints again. I'm rolling this cluster to workaround the issue. > Hints broken for nodes that change broadcast address > > > Key: CASSANDRA-6335 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6335 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Rick Branson > > When a node changes it's broadcast address, the transition process works > properly, but hints that are destined for it can't be delivered because of > the address change. It produces an exception: > java.lang.AssertionError: Missing host ID for 10.1.60.22 > at > org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:598) > at > org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:567) > at > org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1679) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (CASSANDRA-6335) Hints broken for nodes that change broadcast address
Rick Branson created CASSANDRA-6335: --- Summary: Hints broken for nodes that change broadcast address Key: CASSANDRA-6335 URL: https://issues.apache.org/jira/browse/CASSANDRA-6335 Project: Cassandra Issue Type: Bug Components: Core Reporter: Rick Branson When a node changes it's broadcast address, the transition process works properly, but hints that are destined for it can't be delivered because of the address change. It produces an exception: java.lang.AssertionError: Missing host ID for 10.1.60.22 at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:598) at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:567) at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:1679) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (CASSANDRA-6334) Option to not listen on Thrift/CQL if cluster is
Rick Branson created CASSANDRA-6334: --- Summary: Option to not listen on Thrift/CQL if cluster is https://issues.apache.org/jira/browse/CASSANDRA-6334 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Rick Branson Priority: Minor Sometimes nodes get isolated because of bugs or misconfiguration on either the server or the client side. A useful configuration option found in other systems is to not listen for client requests until the cluster reaches a minimum number of nodes. This would prevent "perfectly valid" reads & writes from taking place on these isolated nodes. What do we think about this? -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-6106) QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() / 1000
[ https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779586#comment-13779586 ] Rick Branson commented on CASSANDRA-6106: - Yeah you're right. The thread will probably occasionally get interrupted by the kernel between the currentTimeMillis() and the nanoTime() call. It'd probably take a bunch of iterations to properly "calibrate" the base clock. This is getting pretty far into the extremely-fragile territory. > QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current > timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() > / 1000 > > > Key: CASSANDRA-6106 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6106 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DSE Cassandra 3.1, but also HEAD >Reporter: Christopher Smith >Priority: Minor > Labels: collision, conflict, timestamp > Attachments: microtimstamp.patch > > > I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra > mentioned issues with millisecond rounding in timestamps and was able to > reproduce the issue. If I specify a timestamp in a mutating query, I get > microsecond precision, but if I don't, I get timestamps rounded to the > nearest millisecond, at least for my first query on a given connection, which > substantially increases the possibilities of collision. > I believe I found the offending code, though I am by no means sure this is > comprehensive. I think we probably need a fairly comprehensive replacement of > all uses of System.currentTimeMillis() with System.nanoTime(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-6106) QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() / 1000
[ https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779573#comment-13779573 ] Rick Branson commented on CASSANDRA-6106: - I think you'd actually have to use the discrepancy between currentTimeMillis() now-start delta vs nanoTime() now-start delta to detect wall-clock drift/change and correct the nanoTime(). > QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current > timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() > / 1000 > > > Key: CASSANDRA-6106 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6106 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DSE Cassandra 3.1, but also HEAD >Reporter: Christopher Smith >Priority: Minor > Labels: collision, conflict, timestamp > Attachments: microtimstamp.patch > > > I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra > mentioned issues with millisecond rounding in timestamps and was able to > reproduce the issue. If I specify a timestamp in a mutating query, I get > microsecond precision, but if I don't, I get timestamps rounded to the > nearest millisecond, at least for my first query on a given connection, which > substantially increases the possibilities of collision. > I believe I found the offending code, though I am by no means sure this is > comprehensive. I think we probably need a fairly comprehensive replacement of > all uses of System.currentTimeMillis() with System.nanoTime(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-6106) QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() / 1000
[ https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779564#comment-13779564 ] Rick Branson commented on CASSANDRA-6106: - Could this re-sample the base wallclock at an interval so that it doesn't drift too far? (say, every 1s or 100ms) > QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current > timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() > / 1000 > > > Key: CASSANDRA-6106 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6106 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DSE Cassandra 3.1, but also HEAD >Reporter: Christopher Smith >Priority: Minor > Labels: collision, conflict, timestamp > Attachments: microtimstamp.patch > > > I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra > mentioned issues with millisecond rounding in timestamps and was able to > reproduce the issue. If I specify a timestamp in a mutating query, I get > microsecond precision, but if I don't, I get timestamps rounded to the > nearest millisecond, at least for my first query on a given connection, which > substantially increases the possibilities of collision. > I believe I found the offending code, though I am by no means sure this is > comprehensive. I think we probably need a fairly comprehensive replacement of > all uses of System.currentTimeMillis() with System.nanoTime(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5948) StreamOut doesn't correctly handle wrapped ranges
[ https://issues.apache.org/jira/browse/CASSANDRA-5948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13758245#comment-13758245 ] Rick Branson commented on CASSANDRA-5948: - Curious what the impact of this would have been? Has streaming effectively been broken since 1.2.6, since every cluster has wrapping ranges? > StreamOut doesn't correctly handle wrapped ranges > - > > Key: CASSANDRA-5948 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5948 > Project: Cassandra > Issue Type: Bug >Reporter: Sergio Bossa > Fix For: 1.2.10, 2.0.1 > > Attachments: 5948-0001.patch > > > StreamOut doesn't normalize ranges, causing AbstractViewSSTableFinder to miss > sstables when the requested range is wrapped, and hence breaking node > bootstrapping/unbootstrapping on such ranges. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-5966) Average name query performance much worse for wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-5966: Attachment: 5966-trunk.txt Patch for trunk. > Average name query performance much worse for wide rows > --- > > Key: CASSANDRA-5966 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5966 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Rick Branson > Fix For: 1.2.10, 2.0.1 > > Attachments: 5966-trunk.txt, 5966.txt, 5966-v2.txt > > > The average performance for by-name queries degrades heavily on wide rows. > This is because in the wide row case SSTableNamesIterator deserializes every > column in the entire row chunk (64KB by default), where-as in the narrow row > case, it stops deserializing as soon as it's found the columns it's looking > for. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-5966) Average name query performance much worse for wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755634#comment-13755634 ] Rick Branson commented on CASSANDRA-5966: - v2 Patch fixes the wide-row case where one of the query columns is an index entry's end range due to SortedSet.subSet() being end-exclusive. > Average name query performance much worse for wide rows > --- > > Key: CASSANDRA-5966 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5966 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Rick Branson > Attachments: 5966.txt, 5966-v2.txt > > > The average performance for by-name queries degrades heavily on wide rows. > This is because in the wide row case SSTableNamesIterator deserializes every > column in the entire row chunk (64KB by default), where-as in the narrow row > case, it stops deserializing as soon as it's found the columns it's looking > for. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-5966) Average name query performance much worse for wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rick Branson updated CASSANDRA-5966: Attachment: 5966-v2.txt > Average name query performance much worse for wide rows > --- > > Key: CASSANDRA-5966 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5966 > Project: Cassandra > Issue Type: Bug >Reporter: Rick Branson >Assignee: Rick Branson > Attachments: 5966.txt, 5966-v2.txt > > > The average performance for by-name queries degrades heavily on wide rows. > This is because in the wide row case SSTableNamesIterator deserializes every > column in the entire row chunk (64KB by default), where-as in the narrow row > case, it stops deserializing as soon as it's found the columns it's looking > for. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira