[jira] [Commented] (CASSANDRA-8430) Updating a row that has a TTL produce unexpected results
[ https://issues.apache.org/jira/browse/CASSANDRA-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240804#comment-14240804 ] Sylvain Lebresne commented on CASSANDRA-8430: - bq. Now at first the row has 'abc', 2, 'whatever', then after the update it has 'abc', 0, 'whatever'. If that was the case, that would be a bug, setting foo to {{null}} is not equivalent to setting it to {{0}}. But if you say use the java driver and do a {{getInt()}} to fetch the value of {{foo}}, that method will return {{0}} for a {{null}} value because it return unboxed values (there's a {{isNull()}} method to check if it's actually {{null}}). So make sure this is not just what you're running into. bq. It seems there's a difference between insert and update There is one. An insert kind of set the primary key columns on their own, while update doesn't, which means that after an insert a row will continue to exist even if you remove all non-PK columns, while it won't with an update. Which is exactly what you're observing. Now, none of that are bugs of Cassandra, so if you have further questions on behaviors, would you mind moving the conversation to the user mailing list (if only because answers might be profitable to more people there)? Updating a row that has a TTL produce unexpected results Key: CASSANDRA-8430 URL: https://issues.apache.org/jira/browse/CASSANDRA-8430 Project: Cassandra Issue Type: Bug Reporter: Alan Boudreault Labels: cassandra, ttl Fix For: 2.0.11, 2.1.2, 3.0 Attachments: test.sh Reported on stackoverflow: http://stackoverflow.com/questions/27280407/cassandra-ttl-gets-set-to-0-on-primary-key-if-no-ttl-is-specified-on-an-update?newreg=19e8c6757c62474985fef7c3037e8c08 I can reproduce the issue with 2.0, 2.1 and trunk. I've attached a small script to reproduce the issue with CCM, and here is it's output: {code} aboudreault@kovarro:~/dev/cstar/so27280407$ ./test.sh Current cluster is now: local Insert data with a 5 sec TTL INSERT INTO ks.tbl (pk, foo, bar) values (1, 1, 'test') using TTL 5; pk | bar | foo +--+- 1 | test | 1 (1 rows) Update data with no TTL UPDATE ks.tbl set bar='change' where pk=1; sleep 6 sec BUG: Row should be deleted now, but isn't. and foo column has been deleted??? pk | bar| foo ++-- 1 | change | null (1 rows) Insert data with a 5 sec TTL INSERT INTO ks.tbl (pk, foo, bar) values (1, 1, 'test') using TTL 5; pk | bar | foo +--+- 1 | test | 1 (1 rows) Update data with a higher (10) TTL UPDATE ks.tbl USING TTL 10 set bar='change' where pk=1; sleep 6 sec BUG: foo column has been deleted? pk | bar| foo ++-- 1 | change | null (1 rows) sleep 5 sec Data is deleted now after the second TTL set during the update. Is this a bug or the expected behavior? (0 rows) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240844#comment-14240844 ] Alexander Radzin commented on CASSANDRA-8390: - I have the same issue with Windows 8. Here is the DiskAccessMode line that I found in system.log of cassandra: {noformat} INFO [main] 2014-12-09 16:07:25,985 DatabaseDescriptor.java:203 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap {noformat} I have several lines like this. Here are the conditions that make this problem to be reproduced. Our application creates new keyspace every day. The keyspace contains about 60 tables. The issue happened on production relatively seldom, however it happens in testing environment all the time because each test case creates keyspace again. I guess that the problem is not specifically in creating keyspace and tables because sometimes the problem happens when trying to run {{truncate}}. Cassandra DB is running using default settings. The client code looks like the following: {noformat} Cluster cluster = Cluster.builder().addContactPoint(localhost).build(); Session session = cluster.connect(); String year = 2013; for (int i = 1; i = 12; i++) { String yearMonth = year + i; for (String template : cql.split(\\n)) { String query = String.format(template, yearMonth); System.out.println(query); session.execute(query); } } {noformat} Where {{cql}} contains {{create keyspace}} and a lot of {{create table}} statements. Interesting fact is that problem _does not appear_ when using asynchronous call: {noformat} CollectionResultSetFuture futures = new ArrayList(); Cluster cluster = Cluster.builder().addContactPoint(localhost).build(); Session session = cluster.connect(); String year = 2013; for (int i = 1; i = 1200; i++) { String yearMonth = year + i; for (String template : cql.split(\\n)) { String query = String.format(template, yearMonth); System.out.println(query); ResultSetFuture future = session.executeAsync(query); futures.add(future); } } Futures.successfulAsList(futures); {noformat} Although this can be a temporary workaround I will try to use the problem itself is IMHO extremely critical. Full source code can be found [here|https://gist.github.com/alexradzin/9223fc16e95318e017ec]. The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71]
[jira] [Comment Edited] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240844#comment-14240844 ] Alexander Radzin edited comment on CASSANDRA-8390 at 12/10/14 9:26 AM: --- I have the same issue with Windows 8. Here is the DiskAccessMode line that I found in system.log of cassandra: {noformat} INFO [main] 2014-12-09 16:07:25,985 DatabaseDescriptor.java:203 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap {noformat} I have several lines like this. Important: when this happens client gets {{NoHostAvailableException}} and stops working that requires restart of cassandra. {noformat} com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:259) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:175) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36) at com.clarisite.clingine.dataaccesslayer.cassandra.CQLTest1.cqlSync(CQLTest1.java:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:102) at com.datastax.driver.core.SessionManager.execute(SessionManager.java:461) at com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:497) at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:87) ... 34 more {noformat} Here are the conditions that make this problem to be reproduced. Our application creates new keyspace every day. The keyspace contains about 60 tables. The issue happened on production relatively seldom, however it happens in testing environment all the time because each test case creates keyspace again. I guess that the problem is not specifically in creating keyspace and tables because sometimes the problem happens when trying to run {{truncate}}. Cassandra DB is running using default settings. The client code looks like the following: {noformat} Cluster cluster = Cluster.builder().addContactPoint(localhost).build(); Session session =
[jira] [Updated] (CASSANDRA-7947) Change error message when RR times out
[ https://issues.apache.org/jira/browse/CASSANDRA-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sam Tunnicliffe updated CASSANDRA-7947: --- Attachment: 7947.txt Attaching a simple patch which correctly sets the CL on the thrown exception. Looks like this is only an issue when read requests following a digest mismatch time out, if we timeout waiting on acks for the subsequent mutations, the correct CL is communicated. Change error message when RR times out -- Key: CASSANDRA-7947 URL: https://issues.apache.org/jira/browse/CASSANDRA-7947 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Sam Tunnicliffe Priority: Minor Fix For: 2.0.12 Attachments: 7947.txt When a quorum request detects a checksum mismatch, it then reads the data to repair the mismatch by issuing a request at CL.ALL to the same endpoints (SP.fetchRows) If this request in turn times out, this delivers a TOE to the client with a misleading message that mentions CL.ALL, possibly causing them to think the request has gone cross-DC when it has not, it was just slow due to timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240844#comment-14240844 ] Alexander Radzin edited comment on CASSANDRA-8390 at 12/10/14 10:29 AM: I have the same issue with Windows 8. Here is the DiskAccessMode line that I found in system.log of cassandra: {noformat} INFO [main] 2014-12-09 16:07:25,985 DatabaseDescriptor.java:203 - DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap {noformat} I have several lines like this. Important: when this happens client gets {{NoHostAvailableException}} and stops working that requires restart of cassandra. {noformat} com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:259) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:175) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36) at com.clarisite.clingine.dataaccesslayer.cassandra.CQLTest1.cqlSync(CQLTest1.java:56) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:102) at com.datastax.driver.core.SessionManager.execute(SessionManager.java:461) at com.datastax.driver.core.SessionManager.executeQuery(SessionManager.java:497) at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:87) ... 34 more {noformat} Here are the conditions that make this problem to be reproduced. Our application creates new keyspace every day. The keyspace contains about 60 tables. The issue happened on production relatively seldom, however it happens in testing environment all the time because each test case creates keyspace again. I guess that the problem is not specifically in creating keyspace and tables because sometimes the problem happens when trying to run {{truncate}}. Cassandra DB is running using default settings. The client code looks like the following: {noformat} Cluster cluster = Cluster.builder().addContactPoint(localhost).build(); Session session =
[jira] [Created] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
Jacques-Henri Berthemet created CASSANDRA-8453: -- Summary: Ability to override TTL on different data-centers, plus one-way replication Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7947) Change error message when RR times out
[ https://issues.apache.org/jira/browse/CASSANDRA-7947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-7947: - Reviewer: Aleksey Yeschenko Change error message when RR times out -- Key: CASSANDRA-7947 URL: https://issues.apache.org/jira/browse/CASSANDRA-7947 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Sam Tunnicliffe Priority: Minor Fix For: 2.0.12 Attachments: 7947.txt When a quorum request detects a checksum mismatch, it then reads the data to repair the mismatch by issuing a request at CL.ALL to the same endpoints (SP.fetchRows) If this request in turn times out, this delivers a TOE to the client with a misleading message that mentions CL.ALL, possibly causing them to think the request has gone cross-DC when it has not, it was just slow due to timing out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240973#comment-14240973 ] Aleksey Yeschenko commented on CASSANDRA-8453: -- Not feasible, sorry. Goes against core Cassandra principles. You could create two separate keyspaces for this data, and write to both. With TTL to one of them, without TTL to the other one. Maybe have replication factor 0 for one of the DCs. That's as close as you are going to get, I'm afraid. Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240975#comment-14240975 ] Benedict commented on CASSANDRA-8449: - Unless we explicitly force all queries to yield a timeout response even if they have successfully terminated after the timeout, and we enforce this constraint _after_ copying the data to the output buffers (netty and thrift), this is guaranteed to return junk data to a user somewhere, sometime. So I am -1 on this approach. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240995#comment-14240995 ] Aleksey Yeschenko commented on CASSANDRA-8449: -- We will, but only once we have CASSANDRA-7392. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241004#comment-14241004 ] Benedict commented on CASSANDRA-8449: - Depending on how that is implemented. I will go out on a limb and predict it will offer no such guarantee, as there will always be a potential race condition (easily triggered by e.g. lengthy GC pauses) without enforcing the constraint _after_ performing the copy to the transport buffers, which is a very specific condition that I don't think is being considered for CASSANDRA-7392. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241005#comment-14241005 ] Aleksey Yeschenko commented on CASSANDRA-8449: -- Fair enough. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jonathan lacefield updated CASSANDRA-8447: -- Description: Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking Linux OS settings tested: # Disabled Transparent Huge Pages echo never /sys/kernel/mm/transparent_hugepage/enabled echo never /sys/kernel/mm/transparent_hugepage/defrag # Enabled Huge Pages echo 215 /proc/sys/kernel/shmmax (over 20GB for heap) echo 1536 /proc/sys/vm/nr_hugepages (20GB/2MB page size) # Disabled NUMA numa-off in /etc/grub.confdatastax # Verified all settings documented here were implemented http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html Attachments: # .yaml # fio output - results.tar.gz # 50 thread heap dump - will update new heap dump soon # 100 thread - visual vm anonymous screenshot - visualvm_screenshot # dstat screen shot of with compaction - Node_with_compaction.png # dstat screen shot of without compaction -- Node_without_compaction.png # gcinspector messages from system.log # gc.log output
[jira] [Commented] (CASSANDRA-8429) Stress on trunk fails mixed workload on missing keys
[ https://issues.apache.org/jira/browse/CASSANDRA-8429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241076#comment-14241076 ] Marcus Eriksson commented on CASSANDRA-8429: Branch for this here: https://github.com/krummas/cassandra/commit/b23b7b3c2e5a800fefd86b0427dcffe3d1c7efb1 Approach is to close the finished files (but keep them as .tmp), make tmplinks from those closed files and open sstablereaders over the tmplink files. Then, when we actually finish, we rename the tmp files to final files and open readers over those. Stress on trunk fails mixed workload on missing keys Key: CASSANDRA-8429 URL: https://issues.apache.org/jira/browse/CASSANDRA-8429 Project: Cassandra Issue Type: Bug Environment: Ubuntu 14.04 Reporter: Ariel Weisberg Assignee: Marcus Eriksson Attachments: cluster.conf, run_stress.sh Starts as part of merge commit 25be46497a8df46f05ffa102bc645bfd684ea48a Stress will say that a key wasn't validated because it isn't returned even though it's loaded. The key will eventually appear and can be queried using cqlsh. Reproduce with #!/bin/sh ROWCOUNT=1000 SCHEMA='-col n=fixed(1) -schema compaction(strategy=LeveledCompactionStrategy) compression=LZ4Compressor' ./cassandra-stress write n=$ROWCOUNT -node xh61 -pop seq=1..$ROWCOUNT no-wrap -rate threads=25 $SCHEMA ./cassandra-stress mixed ratio(read=2) n=1 -node xh61 -pop dist=extreme(1..$ROWCOUNT,0.6) -rate threads=25 $SCHEMA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8429) Stress on trunk fails mixed workload on missing keys
[ https://issues.apache.org/jira/browse/CASSANDRA-8429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-8429: --- Reviewer: Benedict Since Version: 2.1.2 could you review [~benedict]? Stress on trunk fails mixed workload on missing keys Key: CASSANDRA-8429 URL: https://issues.apache.org/jira/browse/CASSANDRA-8429 Project: Cassandra Issue Type: Bug Environment: Ubuntu 14.04 Reporter: Ariel Weisberg Assignee: Marcus Eriksson Attachments: cluster.conf, run_stress.sh Starts as part of merge commit 25be46497a8df46f05ffa102bc645bfd684ea48a Stress will say that a key wasn't validated because it isn't returned even though it's loaded. The key will eventually appear and can be queried using cqlsh. Reproduce with #!/bin/sh ROWCOUNT=1000 SCHEMA='-col n=fixed(1) -schema compaction(strategy=LeveledCompactionStrategy) compression=LZ4Compressor' ./cassandra-stress write n=$ROWCOUNT -node xh61 -pop seq=1..$ROWCOUNT no-wrap -rate threads=25 $SCHEMA ./cassandra-stress mixed ratio(read=2) n=1 -node xh61 -pop dist=extreme(1..$ROWCOUNT,0.6) -rate threads=25 $SCHEMA -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8452) Add missing systems to FBUtilities.isUnix, add FBUtilities.isWindows
[ https://issues.apache.org/jira/browse/CASSANDRA-8452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241078#comment-14241078 ] Joshua McKenzie commented on CASSANDRA-8452: Check out v2 on CASSANDRA-6993 as well as Benedict's comment about it. We should probably change the call to isPosixCompliant and compute and store the boolean at static init time and just reference that rather than strcmp every time we want to check it. Add missing systems to FBUtilities.isUnix, add FBUtilities.isWindows Key: CASSANDRA-8452 URL: https://issues.apache.org/jira/browse/CASSANDRA-8452 Project: Cassandra Issue Type: Bug Reporter: Blake Eggleston Assignee: Blake Eggleston Priority: Minor Fix For: 2.1.3 Attachments: CASSANDRA-8452.patch The isUnix method leaves out a few unix systems, which, after the changes in CASSANDRA-8136, causes some unexpected behavior during shutdown. It would also be clearer if FBUtilities had an isWindows method for branching into Windows specific logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241081#comment-14241081 ] Joshua McKenzie commented on CASSANDRA-8390: [~alexander_radzin]: have you tried disk_access_mode: standard in your test environment to see if it resolves this issue? (see CASSANDRA-6993) The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[na:1.7.0_71] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) ~[na:1.7.0_71] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) ~[na:1.7.0_71] at java.nio.file.Files.delete(Files.java:1079) ~[na:1.7.0_71] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:131) ~[cassandra-all-2.1.1.jar:2.1.1] ... 11 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
cassandra git commit: Remove tmplink files for offline compactions
Repository: cassandra Updated Branches: refs/heads/cassandra-2.1 d69728f8a - 29259cb22 Remove tmplink files for offline compactions Patch by marcuse; reviewed by jmckenzie for CASSANDRA-8321 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/29259cb2 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/29259cb2 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/29259cb2 Branch: refs/heads/cassandra-2.1 Commit: 29259cb22c2ba02d5c2beba6c6512173f8b5b3f9 Parents: d69728f Author: Marcus Eriksson marc...@apache.org Authored: Tue Nov 25 11:12:20 2014 +0100 Committer: Marcus Eriksson marc...@apache.org Committed: Wed Dec 10 14:46:44 2014 +0100 -- CHANGES.txt | 1 + .../cassandra/io/sstable/SSTableRewriter.java | 31 +-- .../io/sstable/SSTableRewriterTest.java | 91 +++- 3 files changed, 79 insertions(+), 44 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/29259cb2/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 3545afc..2e74a15 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 2.1.3 + * Remove tmplink files for offline compactions (CASSANDRA-8321) * Reduce maxHintsInProgress (CASSANDRA-8415) * BTree updates may call provided update function twice (CASSANDRA-8018) * Release sstable references after anticompaction (CASSANDRA-8386) http://git-wip-us.apache.org/repos/asf/cassandra/blob/29259cb2/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java -- diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java b/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java index d187e9d..f9d2fe4 100644 --- a/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java +++ b/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java @@ -190,9 +190,15 @@ public class SSTableRewriter for (PairSSTableWriter, SSTableReader w : finishedWriters) { -// we should close the bloom filter if we have not opened an sstable reader from this -// writer (it will get closed when we release the sstable reference below): +// we should close the bloom filter if we have not opened an sstable reader from this +// writer (it will get closed when we release the sstable reference below): w.left.abort(w.right == null); +if (isOffline w.right != null) +{ +// the pairs get removed from finishedWriters when they are closedAndOpened in finish(), the ones left need to be removed here: +w.right.markObsolete(); +w.right.releaseReference(); +} } // also remove already completed SSTables @@ -344,7 +350,15 @@ public class SSTableRewriter finished.add(newReader); if (w.right != null) +{ w.right.sharesBfWith(newReader); +if (isOffline) +{ +// remove the tmplink files if we are offline - no one is using them +w.right.markObsolete(); +w.right.releaseReference(); +} +} // w.right is the tmplink-reader we added when switching writer, replace with the real sstable. toReplace.add(Pair.create(w.right, newReader)); } @@ -356,11 +370,10 @@ public class SSTableRewriter it.remove(); } -for (PairSSTableReader, SSTableReader replace : toReplace) -replaceEarlyOpenedFile(replace.left, replace.right); - if (!isOffline) { +for (PairSSTableReader, SSTableReader replace : toReplace) +replaceEarlyOpenedFile(replace.left, replace.right); dataTracker.unmarkCompacting(finished); } return finished; @@ -382,8 +395,16 @@ public class SSTableRewriter { SSTableReader newReader = w.left.closeAndOpenReader(maxAge); finished.add(newReader); + if (w.right != null) +{ w.right.sharesBfWith(newReader); +if (isOffline) +{ +w.right.markObsolete(); +w.right.releaseReference(); +} +} // w.right is the tmplink-reader we added when switching writer, replace with the real sstable. toReplace.add(Pair.create(w.right, newReader)); }
[1/2] cassandra git commit: Remove tmplink files for offline compactions
Repository: cassandra Updated Branches: refs/heads/trunk 2240455f0 - c64ac4188 Remove tmplink files for offline compactions Patch by marcuse; reviewed by jmckenzie for CASSANDRA-8321 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/29259cb2 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/29259cb2 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/29259cb2 Branch: refs/heads/trunk Commit: 29259cb22c2ba02d5c2beba6c6512173f8b5b3f9 Parents: d69728f Author: Marcus Eriksson marc...@apache.org Authored: Tue Nov 25 11:12:20 2014 +0100 Committer: Marcus Eriksson marc...@apache.org Committed: Wed Dec 10 14:46:44 2014 +0100 -- CHANGES.txt | 1 + .../cassandra/io/sstable/SSTableRewriter.java | 31 +-- .../io/sstable/SSTableRewriterTest.java | 91 +++- 3 files changed, 79 insertions(+), 44 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/29259cb2/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 3545afc..2e74a15 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 2.1.3 + * Remove tmplink files for offline compactions (CASSANDRA-8321) * Reduce maxHintsInProgress (CASSANDRA-8415) * BTree updates may call provided update function twice (CASSANDRA-8018) * Release sstable references after anticompaction (CASSANDRA-8386) http://git-wip-us.apache.org/repos/asf/cassandra/blob/29259cb2/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java -- diff --git a/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java b/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java index d187e9d..f9d2fe4 100644 --- a/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java +++ b/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java @@ -190,9 +190,15 @@ public class SSTableRewriter for (PairSSTableWriter, SSTableReader w : finishedWriters) { -// we should close the bloom filter if we have not opened an sstable reader from this -// writer (it will get closed when we release the sstable reference below): +// we should close the bloom filter if we have not opened an sstable reader from this +// writer (it will get closed when we release the sstable reference below): w.left.abort(w.right == null); +if (isOffline w.right != null) +{ +// the pairs get removed from finishedWriters when they are closedAndOpened in finish(), the ones left need to be removed here: +w.right.markObsolete(); +w.right.releaseReference(); +} } // also remove already completed SSTables @@ -344,7 +350,15 @@ public class SSTableRewriter finished.add(newReader); if (w.right != null) +{ w.right.sharesBfWith(newReader); +if (isOffline) +{ +// remove the tmplink files if we are offline - no one is using them +w.right.markObsolete(); +w.right.releaseReference(); +} +} // w.right is the tmplink-reader we added when switching writer, replace with the real sstable. toReplace.add(Pair.create(w.right, newReader)); } @@ -356,11 +370,10 @@ public class SSTableRewriter it.remove(); } -for (PairSSTableReader, SSTableReader replace : toReplace) -replaceEarlyOpenedFile(replace.left, replace.right); - if (!isOffline) { +for (PairSSTableReader, SSTableReader replace : toReplace) +replaceEarlyOpenedFile(replace.left, replace.right); dataTracker.unmarkCompacting(finished); } return finished; @@ -382,8 +395,16 @@ public class SSTableRewriter { SSTableReader newReader = w.left.closeAndOpenReader(maxAge); finished.add(newReader); + if (w.right != null) +{ w.right.sharesBfWith(newReader); +if (isOffline) +{ +w.right.markObsolete(); +w.right.releaseReference(); +} +} // w.right is the tmplink-reader we added when switching writer, replace with the real sstable. toReplace.add(Pair.create(w.right, newReader)); }
[2/2] cassandra git commit: Merge branch 'cassandra-2.1' into trunk
Merge branch 'cassandra-2.1' into trunk Conflicts: test/unit/org/apache/cassandra/io/sstable/SSTableRewriterTest.java Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/c64ac418 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/c64ac418 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/c64ac418 Branch: refs/heads/trunk Commit: c64ac41884f328d0868baee31dbb7a6f685f22f8 Parents: 2240455 29259cb Author: Marcus Eriksson marc...@apache.org Authored: Wed Dec 10 14:51:34 2014 +0100 Committer: Marcus Eriksson marc...@apache.org Committed: Wed Dec 10 14:51:34 2014 +0100 -- CHANGES.txt | 1 + .../cassandra/io/sstable/SSTableRewriter.java | 31 +-- .../io/sstable/SSTableRewriterTest.java | 91 +++- 3 files changed, 79 insertions(+), 44 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/c64ac418/CHANGES.txt -- diff --cc CHANGES.txt index 1e1ec89,2e74a15..0029843 --- a/CHANGES.txt +++ b/CHANGES.txt @@@ -1,44 -1,5 +1,45 @@@ +3.0 + * Fix NPE in SelectStatement with empty IN values (CASSANDRA-8419) + * Refactor SelectStatement, return IN results in natural order instead + of IN value list order (CASSANDRA-7981) + * Support UDTs, tuples, and collections in user-defined + functions (CASSANDRA-7563) + * Fix aggregate fn results on empty selection, result column name, + and cqlsh parsing (CASSANDRA-8229) + * Mark sstables as repaired after full repair (CASSANDRA-7586) + * Extend Descriptor to include a format value and refactor reader/writer apis (CASSANDRA-7443) + * Integrate JMH for microbenchmarks (CASSANDRA-8151) + * Keep sstable levels when bootstrapping (CASSANDRA-7460) + * Add Sigar library and perform basic OS settings check on startup (CASSANDRA-7838) + * Support for aggregation functions (CASSANDRA-4914) + * Remove cassandra-cli (CASSANDRA-7920) + * Accept dollar quoted strings in CQL (CASSANDRA-7769) + * Make assassinate a first class command (CASSANDRA-7935) + * Support IN clause on any clustering column (CASSANDRA-4762) + * Improve compaction logging (CASSANDRA-7818) + * Remove YamlFileNetworkTopologySnitch (CASSANDRA-7917) + * Do anticompaction in groups (CASSANDRA-6851) + * Support pure user-defined functions (CASSANDRA-7395, 7526, 7562, 7740, 7781, 7929, + 7924, 7812, 8063, 7813) + * Permit configurable timestamps with cassandra-stress (CASSANDRA-7416) + * Move sstable RandomAccessReader to nio2, which allows using the + FILE_SHARE_DELETE flag on Windows (CASSANDRA-4050) + * Remove CQL2 (CASSANDRA-5918) + * Add Thrift get_multi_slice call (CASSANDRA-6757) + * Optimize fetching multiple cells by name (CASSANDRA-6933) + * Allow compilation in java 8 (CASSANDRA-7028) + * Make incremental repair default (CASSANDRA-7250) + * Enable code coverage thru JaCoCo (CASSANDRA-7226) + * Switch external naming of 'column families' to 'tables' (CASSANDRA-4369) + * Shorten SSTable path (CASSANDRA-6962) + * Use unsafe mutations for most unit tests (CASSANDRA-6969) + * Fix race condition during calculation of pending ranges (CASSANDRA-7390) + * Fail on very large batch sizes (CASSANDRA-8011) + * Improve concurrency of repair (CASSANDRA-6455, 8208) + + 2.1.3 + * Remove tmplink files for offline compactions (CASSANDRA-8321) * Reduce maxHintsInProgress (CASSANDRA-8415) * BTree updates may call provided update function twice (CASSANDRA-8018) * Release sstable references after anticompaction (CASSANDRA-8386) http://git-wip-us.apache.org/repos/asf/cassandra/blob/c64ac418/src/java/org/apache/cassandra/io/sstable/SSTableRewriter.java -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/c64ac418/test/unit/org/apache/cassandra/io/sstable/SSTableRewriterTest.java -- diff --cc test/unit/org/apache/cassandra/io/sstable/SSTableRewriterTest.java index 11030f6,c0a017e..5eae831 --- a/test/unit/org/apache/cassandra/io/sstable/SSTableRewriterTest.java +++ b/test/unit/org/apache/cassandra/io/sstable/SSTableRewriterTest.java @@@ -44,10 -43,8 +44,11 @@@ import org.apache.cassandra.db.compacti import org.apache.cassandra.db.compaction.ICompactionScanner; import org.apache.cassandra.db.compaction.LazilyCompactedRow; import org.apache.cassandra.db.compaction.OperationType; +import org.apache.cassandra.exceptions.ConfigurationException; +import org.apache.cassandra.io.sstable.format.SSTableReader; +import org.apache.cassandra.io.sstable.format.SSTableWriter; +import org.apache.cassandra.locator.SimpleStrategy; + import
[jira] [Updated] (CASSANDRA-8417) Default base_time_seconds in DTCS is almost always too large
[ https://issues.apache.org/jira/browse/CASSANDRA-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Björn Hegerfors updated CASSANDRA-8417: --- Attachment: cassandra-trunk-CASSANDRA-8417-basetime60.txt Sorry about the delayed response. I went with 1 minute. Indeed there's not much of a downside with a too small baseTime. A too big baseTime is much more damaging for performance. It's very much analogous to min_sstable_size in STCS, which would also be bad to set too high, while a very low value ought to not hurt performance much, compared to the ideal value (whatever that is). It will leave very small SSTable scattered for a while longer, which means that there might be more SSTables on disk. But these new SSTables are likely to be in disk cache anyway, right? So I'd go with 60 seconds as a better safe than sorry measure. Default base_time_seconds in DTCS is almost always too large Key: CASSANDRA-8417 URL: https://issues.apache.org/jira/browse/CASSANDRA-8417 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Björn Hegerfors Fix For: 2.0.12, 2.1.3 Attachments: cassandra-trunk-CASSANDRA-8417-basetime60.txt One hour is a very long time to compact all new inserts together with any reasonable volume at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241103#comment-14241103 ] Jacques-Henri Berthemet commented on CASSANDRA-8453: What if I write a custom AbstractReplicationStrategy (extending NetworkTopologyStrategy) that would reset TTL info from the writes received from a non-local DC? Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241108#comment-14241108 ] Aleksey Yeschenko commented on CASSANDRA-8453: -- That is not something a replication strategy can do. Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8356) Slice query on a super column family with counters doesn't get all the data
[ https://issues.apache.org/jira/browse/CASSANDRA-8356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241116#comment-14241116 ] Philo Yang commented on CASSANDRA-8356: --- Hi, in fact I have the same trouble for missing data on slice query. My cluster is 2.1.1, and regular table(create by cql3, no counter). My table is like this: {noformat} CREATE TABLE word ( user text, word text, alter_time bigint, (some other column...) PRIMARY KEY (user, word) ) WITH CLUSTERING ORDER BY (word ASC) {noformat} If a row with 26 column whose column names are a,b...z. Usually I query this table like this: {noformat} select * from word where user = 'userid'; {noformat} and it should return 26 rows in cql. However, in some rows (most of rows in this table won't lose data) it return only part of columns, some row(row for cql, column for cassandra ), will not return even in consistency level ALL. The row that doesn't return is fixed for querying many times. For example, b, is a row that is missing. If I query like {noformat} select * from word where user = 'userid'; or select * from word where user = 'userid' and word 'a'; or select * from word where user = 'userid' and word ='b'; or select * from word where user = 'userid' and word ='b' order by word desc; or select * from word where user = 'userid' and word 'z'; {noformat} 'b' is always missing. But if I query like: {noformat} select * from word where user = 'userid' and word ='b'; or select * from word where user = 'userid' and word ='b'; or select * from word where user = 'userid' and word ='b' order by word desc; {noformat} It will show in result set. Slice query on a super column family with counters doesn't get all the data --- Key: CASSANDRA-8356 URL: https://issues.apache.org/jira/browse/CASSANDRA-8356 Project: Cassandra Issue Type: Bug Reporter: Nicolas Lalevée Assignee: Aleksey Yeschenko Fix For: 2.0.12 We've finally been able to upgrade our cluster to 2.0.11, after CASSANDRA-7188 being fixed. But now slice queries on a super column family with counters doesn't return all the expected data. We first though because of all the trouble we had that we lost data, but there a way to actually get the data, so nothing is lost; it just that cassandra seems to incorrectly skip it. See the following CQL log: {noformat} cqlsh:Theme desc table theme_view; CREATE TABLE theme_view ( key bigint, column1 varint, column2 text, value counter, PRIMARY KEY ((key), column1, column2) ) WITH COMPACT STORAGE AND bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=1.00 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'SnappyCompressor'}; cqlsh:Theme select * from theme_view where key = 99421 limit 10; key | column1 | column2| value ---+-++--- 99421 | -12 | 2011-03-25 |59 99421 | -12 | 2011-03-26 | 5 99421 | -12 | 2011-03-27 | 2 99421 | -12 | 2011-03-28 |40 99421 | -12 | 2011-03-29 |14 99421 | -12 | 2011-03-30 |17 99421 | -12 | 2011-03-31 | 5 99421 | -12 | 2011-04-01 |37 99421 | -12 | 2011-04-02 | 7 99421 | -12 | 2011-04-03 | 4 (10 rows) cqlsh:Theme select * from theme_view where key = 99421 and column1 = -12 limit 10; key | column1 | column2| value ---+-++--- 99421 | -12 | 2011-03-25 |59 99421 | -12 | 2014-05-06 |15 99421 | -12 | 2014-06-06 | 7 99421 | -12 | 2014-06-10 |22 99421 | -12 | 2014-06-11 |34 99421 | -12 | 2014-06-12 |35 99421 | -12 | 2014-06-13 |26 99421 | -12 | 2014-06-14 |16 99421 | -12 | 2014-06-15 |24 99421 | -12 | 2014-06-16 |25 (10 rows) {noformat} As you can see the second query should return data from 2012, but it is not. Via thrift, we have the exact same bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241118#comment-14241118 ] Jacques-Henri Berthemet commented on CASSANDRA-8453: Do you know which class receives/sends the replication messages from other DCs? Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241147#comment-14241147 ] Alan Boudreault commented on CASSANDRA-8316: [~krummas] [~yukim] With some test runs, I cannot see the high CPU utilization issue again. However, I still see the error message. Also I've noticed an important changes between with and without the patch. WITHOUT the patch: I can re-run the increment repairs. I might get again the error message on the node that initially failed, but things will get OK after the initial endpoints that failed are repaired. WITH the patch: I cannot do an incremental repairs anymore, even after a restart. This is what I get trying to run the repairs on my node: {code} aboudreault@kovarro:~/dev/cstar/8316$ ccm node1 nodetool -- repair -par -inc [2014-12-10 09:00:42,767] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-10 09:00:48,045] Repair session ee2a78c0-8074-11e4-9b59-bbfe19a8e904 for range (4611686018427387904,6917529027641081856] finished [2014-12-10 09:00:48,046] Repair session ef77e050-8074-11e4-9b59-bbfe19a8e904 for range (2305843009213693952,4611686018427387904] finished [2014-12-10 09:00:48,048] Repair session f06107d0-8074-11e4-9b59-bbfe19a8e904 for range (6917529027641081856,-9223372036854775808] finished [2014-12-10 09:00:48,078] Repair command #1 finished [2014-12-10 09:00:48,088] Nothing to repair for keyspace 'system' [2014-12-10 09:00:48,104] Starting repair command #2, repairing 2 ranges for keyspace system_traces (parallelism=PARALLEL, full=false) [2014-12-10 09:00:58,916] Repair failed with error Did not get positive replies from all endpoints. List of failed endpoint(s): [127.0.0.2] aboudreault@kovarro:~/dev/cstar/8316$ ccm node2 nodetool -- repair -par -inc [2014-12-10 09:01:07,233] Starting repair command #1, repairing 3 ranges for keyspace r1 (parallelism=PARALLEL, full=false) [2014-12-10 09:01:07,239] Repair failed with error Already repairing SSTableReader(path='/home/aboudreault/.ccm/local/node2/data/r1/Standard1-c38dd6f0807111e494d8bbfe19a8e904/r1-Standard1-ka-5-Data.db'), can not continue. [2014-12-10 09:01:07,247] Nothing to repair for keyspace 'system' [2014-12-10 09:01:07,252] Starting repair command #2, repairing 2 ranges for keyspace system_traces (parallelism=PARALLEL, full=false) [2014-12-10 09:01:07,254] Repair failed with error null {code} Does this help? Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241151#comment-14241151 ] Ariel Weisberg commented on CASSANDRA-8449: --- It's not just junk data right? If the file is unmapped it would probably segfault. I think you need to reference count the file. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153 ] Marcus Eriksson commented on CASSANDRA-8316: yep, will have a look, seems that we don't clear out the repair session on this failure mode Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241153#comment-14241153 ] Marcus Eriksson edited comment on CASSANDRA-8316 at 12/10/14 2:42 PM: -- yep, will have a look, seems that we don't clear out the parent repair session on this failure mode was (Author: krummas): yep, will have a look, seems that we don't clear out the repair session on this failure mode Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241173#comment-14241173 ] T Jake Luciani commented on CASSANDRA-8449: --- Hmm yeah, if there was a pause before serializing the message and compaction was running you would still segfault. I was going to initially try using reference queue and weak references to track when the byte buffer but any .duplicate() would break that model. [~benedict] perhaps we can use CASSANDRA-7705 to manage this? Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241215#comment-14241215 ] Benedict commented on CASSANDRA-8449: - CASSANDRA-7705 is really designed for situations where we know there won't be loads in-flight; i'd prefer not to reintroduce excessive long-lifetime reference counting onto the read critical path (we don't ref count sstable readers anymore, since CASSANDRA-6919). All we're doing here is delaying when we unmap the file until a time it is known to be unused, so we could create a global OpOrder that guards against this; all requests that hit the node are guarded by the OpOrder for their entire duration, and only once _all_ requests that started prior to _thinking_ the data is free do we actually free it. Typically I would not want to use this approach for guarding operations that could take arbitrarily long, but really all we're sacrificing is virtual address space, so being delayed more than you expect (even excessively) should not noticeably impact system performance, as the OS can choose to drop those pages on the floor, keeping only the mapping overhead. Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241216#comment-14241216 ] Robert Stupp commented on CASSANDRA-8453: - I don't know the exact class name. But I'd strongly recommend not to change that behavior in the code. It can and will damage data in the whole cluster since all partitions must be the same on all nodes - that's (in simple words) the code principle Aleksey mentioned. Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241255#comment-14241255 ] Alexander Radzin commented on CASSANDRA-8390: - Joshua McKenzie, I tried this with the same result. I validated that the value was indeed used by cassandra by changing value to something illegal. This threw exception. The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[na:1.7.0_71] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) ~[na:1.7.0_71] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) ~[na:1.7.0_71] at java.nio.file.Files.delete(Files.java:1079) ~[na:1.7.0_71] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:131) ~[cassandra-all-2.1.1.jar:2.1.1] ... 11 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko resolved CASSANDRA-8453. -- Resolution: Not a Problem Right. If you somehow manage to do it your way, on reads the digest would mismatch (ttl is part of the digest calculation) and read repair will sync the data. Same regarding explicit repair. Either you go with writing the same data to several separate keyspaces - with TTL to one, without TTL to another, and only have the non-ttl keyspace in one DC, or, well, there is no other way. Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241255#comment-14241255 ] Alexander Radzin edited comment on CASSANDRA-8390 at 12/10/14 4:06 PM: --- Joshua McKenzie, I tried this with the same result. I validated that the parameter was indeed used by cassandra by changing value to something illegal. This threw exception. was (Author: alexander_radzin): Joshua McKenzie, I tried this with the same result. I validated that the value was indeed used by cassandra by changing value to something illegal. This threw exception. The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[na:1.7.0_71] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) ~[na:1.7.0_71] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) ~[na:1.7.0_71] at java.nio.file.Files.delete(Files.java:1079) ~[na:1.7.0_71] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:131) ~[cassandra-all-2.1.1.jar:2.1.1] ... 11 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8371) DateTieredCompactionStrategy is always compacting
[ https://issues.apache.org/jira/browse/CASSANDRA-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241301#comment-14241301 ] Björn Hegerfors commented on CASSANDRA-8371: [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, do not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spant the while time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? DateTieredCompactionStrategy is always compacting -- Key: CASSANDRA-8371 URL: https://issues.apache.org/jira/browse/CASSANDRA-8371 Project: Cassandra Issue Type: Bug Components: Core Reporter: mck Assignee: Björn Hegerfors Labels: compaction, performance Attachments: java_gc_counts_rate-month.png, read-latency-recommenders-adview.png, read-latency.png, sstables-recommenders-adviews.png, sstables.png, vg2_iad-month.png Running 2.0.11 and having switched a table to [DTCS|https://issues.apache.org/jira/browse/CASSANDRA-6602] we've seen that disk IO and gc count increase, along with the number of reads happening in the compaction hump of cfhistograms. Data, and generally performance, looks good, but compactions are always happening, and pending compactions are building up. The schema for this is {code}CREATE TABLE search ( loginid text, searchid timeuuid, description text, searchkey text, searchurl text, PRIMARY KEY ((loginid), searchid) );{code} We're sitting on about 82G (per replica) across 6 nodes in 4 DCs. CQL executed against this keyspace, and traffic patterns, can be seen in slides 7+8 of https://prezi.com/b9-aj6p2esft/ Attached are sstables-per-read and read-latency graphs from cfhistograms, and screenshots of our munin graphs as we have gone from STCS, to LCS (week ~44), to DTCS (week ~46). These screenshots are also found in the prezi on slides 9-11. [~pmcfadin], [~Bj0rn], Can this be a consequence of occasional deleted rows, as is described under (3) in the description of CASSANDRA-6602 ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8371) DateTieredCompactionStrategy is always compacting
[ https://issues.apache.org/jira/browse/CASSANDRA-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241301#comment-14241301 ] Björn Hegerfors edited comment on CASSANDRA-8371 at 12/10/14 4:13 PM: -- [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, should not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spans the whole time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? was (Author: bj0rn): [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, do not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spant the while time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? DateTieredCompactionStrategy is always compacting -- Key: CASSANDRA-8371 URL: https://issues.apache.org/jira/browse/CASSANDRA-8371 Project: Cassandra Issue Type: Bug
[jira] [Comment Edited] (CASSANDRA-8371) DateTieredCompactionStrategy is always compacting
[ https://issues.apache.org/jira/browse/CASSANDRA-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241301#comment-14241301 ] Björn Hegerfors edited comment on CASSANDRA-8371 at 12/10/14 4:14 PM: -- [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, should not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spans the whole time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can put data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? was (Author: bj0rn): [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, should not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spans the whole time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? DateTieredCompactionStrategy is always compacting -- Key: CASSANDRA-8371 URL: https://issues.apache.org/jira/browse/CASSANDRA-8371 Project: Cassandra Issue
[jira] [Comment Edited] (CASSANDRA-8371) DateTieredCompactionStrategy is always compacting
[ https://issues.apache.org/jira/browse/CASSANDRA-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241301#comment-14241301 ] Björn Hegerfors edited comment on CASSANDRA-8371 at 12/10/14 4:14 PM: -- [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, should not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spans the whole time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can put data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of ideal scheduling? was (Author: bj0rn): [~jshook] I don't understand what you're saying about ideal scheduling. There might be some confusion here, as Marcus's blog post about DTCS draws a simplified picture of how DTCS works. In his picture, the rightmost vertical line represents now. And while now certainly moves forward, the other vertical lines, denoting window borders, should not actually move with it. That's where his description is wrong (I just told him about it). Rather, these windows borders are perfectly static, and the passage of time instead unveils new time windows. The newest window (which now actually lies _inside_ of) is always base_time_seconds in size. Then windows are merged with each other at certain points in time. This is an instantaneous thing. Specifically, min_threshold windows of the same size are merged into one window at exactly the moment when yet another window of that same size is created. Say that min_threshold=4 and base_time_seconds=60 (1 minute). Let's say that the last 4 windows are all 1-minute windows (they certainly don't have to be, there can be anywhere between 1 and 4 same-sized windows). At the turn of the next minute, a new 1-minute window is created, and the previous ones are from that moment considered to be one 4-minute window (there is not moment when there are 5 1-minute windows). The windows are the ideal SSTable placements for DTCS. The idea is that every window only contains one SSTable, that spans the whole time window. In practice, this is also very nearly what happens, except that the compaction triggered by windows merging is not instantaneous. There are some quirks that let more than one SSTable live in one time window. CASSANDRA-8360 wants to address that. CASSANDRA-8361 takes it one step further. It's true that repairs can put data in old windows in at later points. Read repairs don't mix too well with DTCS for that reason, but anti-entropy repair costs so much that an extra compaction at the end makes little difference. I think incremental repair should mix nicely with DTCS, but I don't know much about it. Sorry if you already knew all of this, but in that case, what is you definition of perfect scheduling? DateTieredCompactionStrategy is always compacting -- Key: CASSANDRA-8371 URL: https://issues.apache.org/jira/browse/CASSANDRA-8371 Project: Cassandra Issue
[jira] [Updated] (CASSANDRA-8308) Windows: Commitlog access violations on unit tests
[ https://issues.apache.org/jira/browse/CASSANDRA-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-8308: --- Attachment: 8308_v2.txt v2 attached. Good catch on truncate on nio - I misread the javadoc on that and also assumed they were going for functional parity with RAF.setLength. I couldn't find an analogue to RAF.setLength in nio; rather than creating a single byte ByteBuffer, seeking to DD.getCommitLogSegmentSize(), writing that byte, and seeking back - I went ahead and just used RAF.setLength to get our size and then used the FileChannel API to map it later as it seems less prone to error and opening CLS isn't critical path. If there's a cleaner or more idiomatic way to do that in nio I'm all for it but I couldn't track it down. I also added another call to CommitLog.instance.resetUnsafe in SchemaLoader before we attempt to delete directories as it was failing to delete the memory-mapped files. Not sure why it worked in v1 but it definitely needs it now. Lastly - while I 100% agree the os determination needs to be tightened up (see CASSANDRA-8452), I'm not sure how that's related to this patch as none of the changes reference that. Windows: Commitlog access violations on unit tests -- Key: CASSANDRA-8308 URL: https://issues.apache.org/jira/browse/CASSANDRA-8308 Project: Cassandra Issue Type: Bug Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0 Attachments: 8308_v1.txt, 8308_v2.txt We have four unit tests failing on trunk on Windows, all with FileSystemException's related to the SchemaLoader: {noformat} [junit] Test org.apache.cassandra.db.compaction.DateTieredCompactionStrategyTest FAILED [junit] Test org.apache.cassandra.cql3.ThriftCompatibilityTest FAILED [junit] Test org.apache.cassandra.io.sstable.SSTableRewriterTest FAILED [junit] Test org.apache.cassandra.repair.LocalSyncTaskTest FAILED {noformat} Example error: {noformat} [junit] Caused by: java.nio.file.FileSystemException: build\test\cassandra\commitlog;0\CommitLog-5-1415908745965.log: The process cannot access the file because it is being used by another process. [junit] [junit] at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) [junit] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) [junit] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) [junit] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) [junit] at java.nio.file.Files.delete(Files.java:1079) [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:125) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Branimir Lambov updated CASSANDRA-7032: --- Attachment: TestVNodeAllocation.java Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241332#comment-14241332 ] Branimir Lambov commented on CASSANDRA-7032: Ignoring replication for the time being (more on that below), and looking at what's the best thing we can do when we have an existing setup and we are trying to add a node, I came up with the following approach. We can only assign new vnodes, which means that we can only _take away_ load from other nodes, never add to it. On the one hand this means that underutilized nodes are hopeless until the cluster grows enough for their share to become normal. On the other it means that the best thing to do (aiming for the smallest overutilization, i.e. max deviation from mean) is to take the highest-load nodes and spread their load evenly between them and the new node. Adding a new node gives us vnodes many (_vn_) new tokens to issue, i.e. we can decrease the load in at most _vn_ other nodes. We can pick up the _vn_ highest-load ones, but some of them may already have a lower load than the target spread; we thus select the largest _n = vn_ highest load nodes such that the spread load _t_, which is their combined load divided by _n+1_, is lower than the load of each individual node. We can then choose how to assign _vn_ tokens splitting some of the ranges in these _n_ nodes to reduce the load of each node to _t_. This should also leave the new node with a load of _t_. The attached code implements a simple version of this which improves overutilization very quickly with every new node-- a typical simulation looks like: {code} Random generation of 1000 nodes with 256 tokens each Size 1000 max 1.24 min 0.80 No replication Adding 1 node(s) using NoReplicationTokenDistributor Size 1001 max 1.11 min 0.80 No replication Adding 9 node(s) using NoReplicationTokenDistributor Size 1010 max 1.05 min 0.81 No replication Adding 30 node(s) using NoReplicationTokenDistributor Size 1040 max 1.02 min 0.83 No replication Adding 210 node(s) using NoReplicationTokenDistributor Size 1250 max 1.00 min 1.00 No replication {code} It also constructs clusters from empty pretty well. However, when replication is present the load distribution of this allocation does not look good (the added node tends to take much more than it should; one reason for this is that it becomes a replica of the token ranges it splits), which is not unexpected. I am now trying to see how exactly taking replication into account affects the reasoning above. We can still only remove load, but the way splitting affects the loads is not that clear any more. As far as I can see the following simplification of Cassandra's replication strategies should suffice for handling the current and planned variations: * we have units made up of a number of vnodes whose load we want to be able to balance (currently unit==node, but in the future the unit could be smaller (a disk or core)) * units are bunched up in racks (if racks are not defined, a node is implicitly a rack for its units) * replicas of data must be placed on the closest higher vnodes that belong to different racks * the replication strategy specifies the number of replicas and the set of units belonging to each rack Datacentres are irrelevant as replication is specified within each dc, i.e. we can isolate the vnode allocation to the individual dc. If disk/core-level allocation is in place, the node boundaries within a rack can be ignored as well. Is there anything I'm missing? [~benedict]: I believe you prefer to split the disk/core workload inside the node by assigning a token range (e.g. the vnodes that intersect with a range corresponding to _1/n_ of the token ring are to be handled by that disk/core). I prefer to just choose _1/n_ of the vnodes, because it lets me directly balance them-- do you have any objections to this? Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The
[jira] [Commented] (CASSANDRA-8453) Ability to override TTL on different data-centers, plus one-way replication
[ https://issues.apache.org/jira/browse/CASSANDRA-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241336#comment-14241336 ] Jacques-Henri Berthemet commented on CASSANDRA-8453: OK I understand. Thank you both for those detailed explanations. Regards, JH Ability to override TTL on different data-centers, plus one-way replication --- Key: CASSANDRA-8453 URL: https://issues.apache.org/jira/browse/CASSANDRA-8453 Project: Cassandra Issue Type: Wish Components: Core Reporter: Jacques-Henri Berthemet Here is my scenario: I want to have one datacenter specialized for operations DCO and another for historical/audit DCH. Replication will be used between DCO and DCH. When TTL expires on DCO and data is deleted I'd like the data on DCH to be kept for other purposes. Ideally a different TTL could be set in DCH. I guess this also implies that replication should be done only in DCO = DCH direction so that data is not re-created. But that's secondary, DCH data is not meant to be modified. Is this kind of feature feasible for future versions of Cassandra? If not, would you have some pointers to modify Cassandra in order to achieve this functionality? Thank you. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6993) Windows: remove mmap'ed I/O for index files and force standard file access
[ https://issues.apache.org/jira/browse/CASSANDRA-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241342#comment-14241342 ] Joshua McKenzie commented on CASSANDRA-6993: Very good point. Looking through the code-base, every place where we're using isUnix seems to really mean 'isn't Windows' so I'd be comfortable with that distinction for now with the ability to make it more complex/powerful in the future if necessary. Also, right now we're skipping early re-open on files based on that check for FBUtilities.isUnix (see CASSANDRA-7365) so I'd prefer to get this modification in before 2.1.3 so we can get more coverage / usage of the early re-open logic, at least on OSX-based dev machines. Windows: remove mmap'ed I/O for index files and force standard file access -- Key: CASSANDRA-6993 URL: https://issues.apache.org/jira/browse/CASSANDRA-6993 Project: Cassandra Issue Type: Improvement Reporter: Joshua McKenzie Assignee: Joshua McKenzie Priority: Minor Labels: Windows Fix For: 3.0, 2.1.3 Attachments: 6993_2.1_v1.txt, 6993_v1.txt, 6993_v2.txt Memory-mapped I/O on Windows causes issues with hard-links; we're unable to delete hard-links to open files with memory-mapped segments even using nio. We'll need to push for close to performance parity between mmap'ed I/O and buffered going forward as the buffered / compressed path offers other benefits. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7032) Improve vnode allocation
[ https://issues.apache.org/jira/browse/CASSANDRA-7032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241349#comment-14241349 ] Benedict commented on CASSANDRA-7032: - If you mean for V vnode tokens in ascending order [0..V), and e.g. D disks, the disks would own one of the token lists in the set { [dV/D..(d+1)V/D) : 0 = d D }, and you guarantee that the owned range of each list is balanced with the other lists, this seems pretty analogous to the approach I was describing and perfectly reasonable. The main goal is only that once a range or set of vnode tokens has been assigned to a given resource (disk, cpu, node, rack, whatever) that resource never needs to reassign its tokens. Improve vnode allocation Key: CASSANDRA-7032 URL: https://issues.apache.org/jira/browse/CASSANDRA-7032 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Branimir Lambov Labels: performance, vnodes Fix For: 3.0 Attachments: TestVNodeAllocation.java, TestVNodeAllocation.java, TestVNodeAllocation.java It's been known for a little while that random vnode allocation causes hotspots of ownership. It should be possible to improve dramatically on this with deterministic allocation. I have quickly thrown together a simple greedy algorithm that allocates vnodes efficiently, and will repair hotspots in a randomly allocated cluster gradually as more nodes are added, and also ensures that token ranges are fairly evenly spread between nodes (somewhat tunably so). The allocation still permits slight discrepancies in ownership, but it is bound by the inverse of the size of the cluster (as opposed to random allocation, which strangely gets worse as the cluster size increases). I'm sure there is a decent dynamic programming solution to this that would be even better. If on joining the ring a new node were to CAS a shared table where a canonical allocation of token ranges lives after running this (or a similar) algorithm, we could then get guaranteed bounds on the ownership distribution in a cluster. This will also help for CASSANDRA-6696. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6060) Remove internal use of Strings for ks/cf names
[ https://issues.apache.org/jira/browse/CASSANDRA-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241364#comment-14241364 ] Ariel Weisberg commented on CASSANDRA-6060: --- I am still digging but I am not sure there is much value here. For prepared statements between client and server there are no ks/cf names. Here is the breakdown for a minimum size mutation inside the cluster Size of Ethernet frame - 24 Bytes Size of IPv4 Header (without any options) - 20 bytes Size of TCP Header (without any options) - 20 Bytes 4-bytes protocol magic 4-bytes version 4-bytes timestamp 4-bytes verb 4-bytes parameter count 4-bytes payload length prefix No keyspace name in current versions 2-byte key length key say 10 bytes 4-byte mutation count 1-byte boolean 16-byte cf id 4-byte count of columns Per column 2-byte column name length prefix column name say 8 bytes 1-byte serialization flags 8-byte timestamp 4-byte length prefix column value say 8 bytes Total is 158 bytes. Saving 12 bytes on the CF uuid would be 7.5 %. For single CF mutations this is not a win. Loading data points 16 bytes at a time isn't going to work so hot anyways so people might look into batching at that point. The UUID is not repeated for each cell so it is a one time cost so for workloads that modify multiple cells per CF. The one case where the 12-bytes becomes significant is single cell updates to multiple CFs in one mutation. There the 12-byte overhead converges on 23%. I am going to look at the read path next, but I kind of expect to find something similar. A read is going t o have key overhead and possibly overhead for all the other query parameters that should match the simple single cell mutation case. Remove internal use of Strings for ks/cf names -- Key: CASSANDRA-6060 URL: https://issues.apache.org/jira/browse/CASSANDRA-6060 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 We toss a lot of Strings around internally, including across the network. Once a request has been Prepared, we ought to be able to encode these as int ids. Unfortuntely, we moved from int to uuid in CASSANDRA-3794, which was a reasonable move at the time, but a uuid is a lot bigger than an int. Now that we have CAS we can allow concurrent schema updates while still using sequential int IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8406) Add option to set max_sstable_age in seconds in DTCS
[ https://issues.apache.org/jira/browse/CASSANDRA-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241369#comment-14241369 ] Björn Hegerfors commented on CASSANDRA-8406: I proposed another approach in my last comment on CASSANDRA-8340 (which, by the way, is very tightly coupled to this ticket). The idea is to specify max window re-merge instead of max sstable age. That option would mean, very nearly, how many times do you want each value to be rewritten?. The good thing about that option in this context is that it scales relatively to window size. If small time windows are used (low baseTime), then a small max_window_exponent will indeed lead to a max SSTable age far lower than a day. Consider min_threshold=4 and base_time_seconds=60. Then max_window_exponent=3 would create all the way up to 64-minute windows, and stop after that. With max_window_exponent=10, the largest windows will be ~2 years (actually ~1.995 years, coincidentally). I can implement this. It would not be difficult. But what do you think? Is this option too confusing? Is it a bad thing that changing base_time_seconds also changes the max SSTable age (linearly)? And that min_threshold does the same (polynomially)? It's just that the number of recompactions is what this is all about anyway. So why not be explicit about it? On a second note, would it make sense for some other behavior than no more compactions ever after SSTables get too old? For instance, how about a flag that makes DTCS create infinitely many same-size windows preceding the max window size? So in my first example, infinite 64-minute windows would be produced. In the event of a repair or out-of-order write, a window many days old may be touched and a compaction would trigger in that window. I'm not suggesting this as a default, but maybe it's useful for something? Add option to set max_sstable_age in seconds in DTCS Key: CASSANDRA-8406 URL: https://issues.apache.org/jira/browse/CASSANDRA-8406 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson Fix For: 2.0.12 Attachments: 0001-patch.patch Using days as the unit for max_sstable_age in DTCS might be too much, add option to set it in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241388#comment-14241388 ] Ariel Weisberg commented on CASSANDRA-4139: --- I think variable length integer encoding could be a big space saving in several contexts, but there is an argument against varints. If you want to do zero deserialization/copy varints will fight you because you can't random access fields by offset. What you can do instead is use generic compression. Counter-intuitive but think of the two use cases. I care about bandwidth therefore I need compression anyways for non-integer fields, or I don't care about bandwidth so why not maximize performance. Where this becomes important is in handling large messages where you don't want parse all of it because you are forwarding or may not consume the entire contents. If you have varints and want to be lazy it gets tricky. I am up for trying it and out and measuring.. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241392#comment-14241392 ] Joshua McKenzie commented on CASSANDRA-8390: Thanks for the info Alexander - I'll try and reproduce locally w/the gist you linked. Also - thanks for the reproduction! Those are *very* helpful in cases like this. The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[na:1.7.0_71] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) ~[na:1.7.0_71] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) ~[na:1.7.0_71] at java.nio.file.Files.delete(Files.java:1079) ~[na:1.7.0_71] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:131) ~[cassandra-all-2.1.1.jar:2.1.1] ... 11 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8454) Convert cql_tests from cassandra-dtest to CQLTester unit tests
Philip Thompson created CASSANDRA-8454: -- Summary: Convert cql_tests from cassandra-dtest to CQLTester unit tests Key: CASSANDRA-8454 URL: https://issues.apache.org/jira/browse/CASSANDRA-8454 Project: Cassandra Issue Type: Test Components: Tests Reporter: Philip Thompson Assignee: Philip Thompson Fix For: 3.0, 2.1.3 See the discussion at [this mail|http://mail-archives.apache.org/mod_mbox/cassandra-dev/201405.mbox/%3ccaaam9sva7vmxj5sbyyb6aorltu6sssg3rifo42+hedafrxx...@mail.gmail.com%3E#archives]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8437) Track digest mismatch ratio
[ https://issues.apache.org/jira/browse/CASSANDRA-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-8437: -- Fix Version/s: 2.1.3 Track digest mismatch ratio --- Key: CASSANDRA-8437 URL: https://issues.apache.org/jira/browse/CASSANDRA-8437 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benjamin Lerer Priority: Minor Fix For: 2.1.3 I don't believe we track how often read results in a digest mismatch but we should since that could directly impact read performance in practice. Once we have that data, it might be that some workloads (write heavy most likely) ends up with enough mismatches that going to the data read is more efficient in practice. What we do about it it step 2 however, but getting the data is easy enough. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jonathan lacefield updated CASSANDRA-8447: -- Description: Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking Linux OS settings tested: # Disabled Transparent Huge Pages echo never /sys/kernel/mm/transparent_hugepage/enabled echo never /sys/kernel/mm/transparent_hugepage/defrag # Enabled Huge Pages echo 215 /proc/sys/kernel/shmmax (over 20GB for heap) echo 1536 /proc/sys/vm/nr_hugepages (20GB/2MB page size) # Disabled NUMA numa-off in /etc/grub.confdatastax # Verified all settings documented here were implemented http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html Attachments: # .yaml # fio output - results.tar.gz # 50 thread heap dump - https://drive.google.com/a/datastax.com/file/d/0B4Imdpu2YrEbMGpCZW5ta2liQ2c/view?usp=sharing # 100 thread - visual vm anonymous screenshot - visualvm_screenshot # dstat screen shot of with compaction - Node_with_compaction.png # dstat screen shot of without compaction --
[jira] [Commented] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties
[ https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241435#comment-14241435 ] Philip Thompson commented on CASSANDRA-7873: [~slebresne], this did fix the dtests. Thank you. Replace AbstractRowResolver.replies with collection with tailored properties Key: CASSANDRA-7873 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873 Project: Cassandra Issue Type: Bug Environment: OSX and Ubuntu 14.04 Reporter: Philip Thompson Assignee: Benedict Labels: qa-resolved Fix For: 3.0 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing on trunk only with the following stack trace: {code} Unexpected error in node1 node log: ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - Error occurred during processing of message. java.util.ConcurrentModificationException: null at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) ~[na:1.7.0_65] at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71) ~[main/:na] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28) ~[main/:na] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[main/:na] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[main/:na] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) ~[main/:na] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212) ~[main/:na] at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na] at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) ~[main/:na] at org.apache.cassandra.service.ClientState.login(ClientState.java:178) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] {code} That exception is thrown when the following query is sent: {code} SELECT strategy_options FROM system.schema_keyspaces WHERE keyspace_name = 'system_auth' {code} The test alters the RF of the system_auth keyspace, then shuts down and restarts the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7873) Replace AbstractRowResolver.replies with collection with tailored properties
[ https://issues.apache.org/jira/browse/CASSANDRA-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Thompson updated CASSANDRA-7873: --- Labels: qa-resolved (was: ) Replace AbstractRowResolver.replies with collection with tailored properties Key: CASSANDRA-7873 URL: https://issues.apache.org/jira/browse/CASSANDRA-7873 Project: Cassandra Issue Type: Bug Environment: OSX and Ubuntu 14.04 Reporter: Philip Thompson Assignee: Benedict Labels: qa-resolved Fix For: 3.0 Attachments: 7873.21.txt, 7873.trunk.txt, 7873.txt, 7873_fixup.txt The dtest auth_test.py:TestAuth.system_auth_ks_is_alterable_test is failing on trunk only with the following stack trace: {code} Unexpected error in node1 node log: ERROR [Thrift:1] 2014-09-03 15:48:08,389 CustomTThreadPoolServer.java:219 - Error occurred during processing of message. java.util.ConcurrentModificationException: null at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:859) ~[na:1.7.0_65] at java.util.ArrayList$Itr.next(ArrayList.java:831) ~[na:1.7.0_65] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:71) ~[main/:na] at org.apache.cassandra.service.RowDigestResolver.resolve(RowDigestResolver.java:28) ~[main/:na] at org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:110) ~[main/:na] at org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144) ~[main/:na] at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1228) ~[main/:na] at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1154) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:256) ~[main/:na] at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:212) ~[main/:na] at org.apache.cassandra.auth.Auth.selectUser(Auth.java:257) ~[main/:na] at org.apache.cassandra.auth.Auth.isExistingUser(Auth.java:76) ~[main/:na] at org.apache.cassandra.service.ClientState.login(ClientState.java:178) ~[main/:na] at org.apache.cassandra.thrift.CassandraServer.login(CassandraServer.java:1486) ~[main/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3579) ~[thrift/:na] at org.apache.cassandra.thrift.Cassandra$Processor$login.getResult(Cassandra.java:3563) ~[thrift/:na] at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) ~[libthrift-0.9.1.jar:0.9.1] at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201) ~[main/:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_65] {code} That exception is thrown when the following query is sent: {code} SELECT strategy_options FROM system.schema_keyspaces WHERE keyspace_name = 'system_auth' {code} The test alters the RF of the system_auth keyspace, then shuts down and restarts the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241452#comment-14241452 ] Jonathan Ellis commented on CASSANDRA-8447: --- It looks like the heap is full of memtable data. Is it trying to flush and not able to keep up? Or is it not recognizing that it needs to flush? /cc [~benedict] Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241456#comment-14241456 ] Benedict commented on CASSANDRA-4139: - We aren't bandwidth constrained for any workloads I'm aware of, so what are we hoping to achieve here? We already apply compression to the stream, so this will likely only help bandwidth consumption for individual small payloads where compression cannot be expected to yield much. In such scenarios bandwidth is especially unlikely to be a constraint. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241467#comment-14241467 ] Jonathan Ellis commented on CASSANDRA-8447: --- {noformat} INFO [OptionalTasks:1] 2014-12-03 16:33:22,382 MeteredFlusher.java (line 58) flushing high-traffic column family CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') (estimated 180613400 bytes) INFO [OptionalTasks:1] 2014-12-03 16:33:22,383 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@1920408967(18066400/180664000 serialized/live bytes, 410600 ops) {noformat} Looks like it's using a liveRatio of 1.0 which is almost certainly broken. Need to enable debug logging on Memtable. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241472#comment-14241472 ] jonathan lacefield commented on CASSANDRA-8447: --- 1) flushing operations appear to be fine, no backed up flushwriters, disk io looks acceptable as well FlushWriter 0 0 2722 0 0 2) will enable debug logging on Memtable and update ticket Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12
[jira] [Commented] (CASSANDRA-8316) Did not get positive replies from all endpoints error on incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241502#comment-14241502 ] Alan Boudreault commented on CASSANDRA-8316: Just pointing in case this ticket is related: CASSANDRA-8291 Did not get positive replies from all endpoints error on incremental repair -- Key: CASSANDRA-8316 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra 2.1.2 Reporter: Loic Lambiel Assignee: Marcus Eriksson Fix For: 2.1.3 Attachments: 0001-patch.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, test.sh Hi, I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3) After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving Repair failed with error Did not get positive replies from all endpoints. from nodetool on all remaining nodes : [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace (seq=false, full=false) [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints. All the nodes are up and running and the local system log shows that the repair commands got started and that's it. I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again. I tried to repro on our 3 nodes preproduction cluster without success It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html Any idea? Thanks Loic -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8248) Possible memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-8248: --- Attachment: 8248_v1.txt Attaching patch that releases references to SSTR during SSTableWriter.openEarly if there's trouble. See discussion on CASSANDRA-8061 - this came up during inspection and could lead to leaks. Possible memory leak - Key: CASSANDRA-8248 URL: https://issues.apache.org/jira/browse/CASSANDRA-8248 Project: Cassandra Issue Type: Bug Reporter: Alexander Sterligov Assignee: Shawn Kumar Attachments: 8248_v1.txt, thread_dump Sometimes during repair cassandra starts to consume more memory than expected. Total amount of data on node is about 20GB. Size of the data directory is 66GC because of snapshots. Top reports: {noformat} PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15724 loadbase 20 0 493g 55g 44g S 28 44.2 4043:24 java {noformat} At the /proc/15724/maps there are a lot of deleted file maps {quote} 7f63a6102000-7f63a6332000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6332000-7f63a6562000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6562000-7f63a6792000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6792000-7f63a69c2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a69c2000-7f63a6bf2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6bf2000-7f63a6e22000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6e22000-7f63a7052000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7052000-7f63a7282000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7282000-7f63a74b2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a74b2000-7f63a76e2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a76e2000-7f63a7912000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7912000-7f63a7b42000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7b42000-7f63a7d72000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7d72000-7f63a7fa2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7fa2000-7f63a81d2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a81d2000-7f63a8402000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8402000-7f63a8622000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8622000-7f63a8842000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted)
[jira] [Reopened] (CASSANDRA-8248) Possible memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie reopened CASSANDRA-8248: Assignee: Joshua McKenzie (was: Shawn Kumar) Possible memory leak - Key: CASSANDRA-8248 URL: https://issues.apache.org/jira/browse/CASSANDRA-8248 Project: Cassandra Issue Type: Bug Reporter: Alexander Sterligov Assignee: Joshua McKenzie Attachments: 8248_v1.txt, thread_dump Sometimes during repair cassandra starts to consume more memory than expected. Total amount of data on node is about 20GB. Size of the data directory is 66GC because of snapshots. Top reports: {noformat} PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15724 loadbase 20 0 493g 55g 44g S 28 44.2 4043:24 java {noformat} At the /proc/15724/maps there are a lot of deleted file maps {quote} 7f63a6102000-7f63a6332000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6332000-7f63a6562000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6562000-7f63a6792000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6792000-7f63a69c2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a69c2000-7f63a6bf2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6bf2000-7f63a6e22000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6e22000-7f63a7052000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7052000-7f63a7282000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7282000-7f63a74b2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a74b2000-7f63a76e2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a76e2000-7f63a7912000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7912000-7f63a7b42000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7b42000-7f63a7d72000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7d72000-7f63a7fa2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7fa2000-7f63a81d2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a81d2000-7f63a8402000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8402000-7f63a8622000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8622000-7f63a8842000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8842000-7f63a8a62000 r--s 08:21 9442763
[jira] [Updated] (CASSANDRA-8248) Possible memory leak
[ https://issues.apache.org/jira/browse/CASSANDRA-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-8248: --- Reviewer: Benedict Possible memory leak - Key: CASSANDRA-8248 URL: https://issues.apache.org/jira/browse/CASSANDRA-8248 Project: Cassandra Issue Type: Bug Reporter: Alexander Sterligov Assignee: Joshua McKenzie Attachments: 8248_v1.txt, thread_dump Sometimes during repair cassandra starts to consume more memory than expected. Total amount of data on node is about 20GB. Size of the data directory is 66GC because of snapshots. Top reports: {noformat} PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15724 loadbase 20 0 493g 55g 44g S 28 44.2 4043:24 java {noformat} At the /proc/15724/maps there are a lot of deleted file maps {quote} 7f63a6102000-7f63a6332000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6332000-7f63a6562000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6562000-7f63a6792000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6792000-7f63a69c2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a69c2000-7f63a6bf2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6bf2000-7f63a6e22000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a6e22000-7f63a7052000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7052000-7f63a7282000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7282000-7f63a74b2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a74b2000-7f63a76e2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a76e2000-7f63a7912000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7912000-7f63a7b42000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7b42000-7f63a7d72000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7d72000-7f63a7fa2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a7fa2000-7f63a81d2000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a81d2000-7f63a8402000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8402000-7f63a8622000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8622000-7f63a8842000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db (deleted) 7f63a8842000-7f63a8a62000 r--s 08:21 9442763 /ssd/cassandra/data/iss/feedback_history-d32bc7e048c011e49b989bc3e8a5a440/iss-feedback_history-tmplink-ka-328671-Index.db
[jira] [Commented] (CASSANDRA-8430) Updating a row that has a TTL produce unexpected results
[ https://issues.apache.org/jira/browse/CASSANDRA-8430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241517#comment-14241517 ] Andrew Garrett commented on CASSANDRA-8430: --- [~slebresne] Thanks for the insight. I'm using Datastax DevCenter for this, which to my understanding is a GUI around their Java driver, so that would explain that behavior with the 0 vs null thing. And that also explains the insert vs update behavior. Thanks, I'm satisfied. Updating a row that has a TTL produce unexpected results Key: CASSANDRA-8430 URL: https://issues.apache.org/jira/browse/CASSANDRA-8430 Project: Cassandra Issue Type: Bug Reporter: Alan Boudreault Labels: cassandra, ttl Fix For: 2.0.11, 2.1.2, 3.0 Attachments: test.sh Reported on stackoverflow: http://stackoverflow.com/questions/27280407/cassandra-ttl-gets-set-to-0-on-primary-key-if-no-ttl-is-specified-on-an-update?newreg=19e8c6757c62474985fef7c3037e8c08 I can reproduce the issue with 2.0, 2.1 and trunk. I've attached a small script to reproduce the issue with CCM, and here is it's output: {code} aboudreault@kovarro:~/dev/cstar/so27280407$ ./test.sh Current cluster is now: local Insert data with a 5 sec TTL INSERT INTO ks.tbl (pk, foo, bar) values (1, 1, 'test') using TTL 5; pk | bar | foo +--+- 1 | test | 1 (1 rows) Update data with no TTL UPDATE ks.tbl set bar='change' where pk=1; sleep 6 sec BUG: Row should be deleted now, but isn't. and foo column has been deleted??? pk | bar| foo ++-- 1 | change | null (1 rows) Insert data with a 5 sec TTL INSERT INTO ks.tbl (pk, foo, bar) values (1, 1, 'test') using TTL 5; pk | bar | foo +--+- 1 | test | 1 (1 rows) Update data with a higher (10) TTL UPDATE ks.tbl USING TTL 10 set bar='change' where pk=1; sleep 6 sec BUG: foo column has been deleted? pk | bar| foo ++-- 1 | change | null (1 rows) sleep 5 sec Data is deleted now after the second TTL set during the update. Is this a bug or the expected behavior? (0 rows) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jonathan lacefield updated CASSANDRA-8447: -- Attachment: memtable_debug memtable information from system.log Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241583#comment-14241583 ] jonathan lacefield commented on CASSANDRA-8447: --- attached memtable specific log information through cassandra.db debug. here's an excerpt of memorymeter ratio: DEBUG [MemoryMeter:1] 2014-12-10 10:58:47,932 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.85183949414442 (just-counted was 7.703678988288838). calculation took 9828ms for 518760 cells DEBUG [MemoryMeter:1] 2014-12-10 10:58:54,991 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.993666918209787 (just-counted was 7.987333836419574). calculation took 6766ms for 344480 cells DEBUG [MemoryMeter:1] 2014-12-10 10:59:04,165 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.820729561952485 (just-counted was 7.641459123904968). calculation took 8765ms for 501265 cells Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241608#comment-14241608 ] jonathan lacefield commented on CASSANDRA-8447: --- Collecting memorymeter debug statements. durations look a bit long -- this is during healthy peroid DEBUG [MemoryMeter:1] 2014-12-10 10:41:41,355 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.924682938349301 (just-counted was 7.849365876698601). calculation took 8306ms for 421490 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:42,763 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 5.99974107609572 (just-counted was 1.9994821521914399). calculation took 1170ms for 79935 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:53,384 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 7.777244370394949 (just-counted was 7.777244370394949). calculation took 10491ms for 566260 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:56,842 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 6.347211243843944 (just-counted was 2.6944224876878886). calculation took 3119ms for 195905 cells DEBUG [MemoryMeter:1] 2014-12-10 10:42:06,136 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.878754207137401 (just-counted was 7.757508414274801). calculation took 9022ms for 507230 cells DEBUG [MemoryMeter:1] 2014-12-10 10:42:11,883 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 6.877887628540337 (just-counted was 3.7557752570806753). calculation took 5076ms for 270195 cells Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS
[jira] [Issue Comment Deleted] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jonathan lacefield updated CASSANDRA-8447: -- Comment: was deleted (was: Collecting memorymeter debug statements. durations look a bit long -- this is during healthy peroid DEBUG [MemoryMeter:1] 2014-12-10 10:41:41,355 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.924682938349301 (just-counted was 7.849365876698601). calculation took 8306ms for 421490 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:42,763 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 5.99974107609572 (just-counted was 1.9994821521914399). calculation took 1170ms for 79935 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:53,384 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 7.777244370394949 (just-counted was 7.777244370394949). calculation took 10491ms for 566260 cells DEBUG [MemoryMeter:1] 2014-12-10 10:41:56,842 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 6.347211243843944 (just-counted was 2.6944224876878886). calculation took 3119ms for 195905 cells DEBUG [MemoryMeter:1] 2014-12-10 10:42:06,136 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 8.878754207137401 (just-counted was 7.757508414274801). calculation took 9022ms for 507230 cells DEBUG [MemoryMeter:1] 2014-12-10 10:42:11,883 Memtable.java (line 473) CFS(Keyspace='Keyspace1', ColumnFamily='Standard1') liveRatio is 6.877887628540337 (just-counted was 3.7557752570806753). calculation took 5076ms for 270195 cells ) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8
[jira] [Created] (CASSANDRA-8455) IndexOutOfBoundsException when building SyntaxError message snippet
Tyler Hobbs created CASSANDRA-8455: -- Summary: IndexOutOfBoundsException when building SyntaxError message snippet Key: CASSANDRA-8455 URL: https://issues.apache.org/jira/browse/CASSANDRA-8455 Project: Cassandra Issue Type: Bug Components: Core Reporter: Tyler Hobbs Assignee: Benjamin Lerer Priority: Minor Fix For: 2.1.3 It looks like some syntax errors can result in an IndexOutOfBoundsException when the error message snippet is being built: {noformat} cqlsh create table foo (a int primary key, b int; ErrorMessage code=2000 [Syntax error in CQL query] message=Failed parsing statement: [create table foo (a int primary key, b int;] reason: ArrayIndexOutOfBoundsException -1 {noformat} There isn't any error or stacktrace in the server logs. It would be good to fix that as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241644#comment-14241644 ] Ariel Weisberg commented on CASSANDRA-4139: --- Is bandwidth a constraint for WAN replication? In practice is the default for messaging to have compression on? What are people doing in the wild? I could imagine varint encoding being a win for Cells where the names and values are integers and queries are bulk loading or selecting ranges. At the storage level it seems like the kind of thing that could beat general purpose compression if you know what data type you are dealing with and have a lot of 0 padded values. I have heard talk about using a column store and run length encoding approach for storage which makes it seem like varint encoding would be the tool of choice for storage either. The code changes don't look bad. It's mostly swapping types for streams and changes to calculating serialized size so that it is aware of the impact of variable length encoded integers. It could save bandwidth, but it could also be slower since you spend more cycles calculating serialized size and encoding/decoding integers. If you end up using compression in bandwidth sensitive scenarios you may not win much. Not varint encoding the data going in/out of the database means you only save real space proportionally when you have small operations going in/out. The flip side is that you can't do that many small ops anyways so you aren't bandwidth constrained. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241644#comment-14241644 ] Ariel Weisberg edited comment on CASSANDRA-4139 at 12/10/14 8:19 PM: - Is bandwidth a constraint for WAN replication? In practice is the default for messaging to have compression on? What are people doing in the wild? I could imagine varint encoding being a win for Cells where the names and values are integers and queries are bulk loading or selecting ranges. At the storage level it seems like the kind of thing that could beat general purpose compression if you know what data type you are dealing with and have a lot of 0 padded values. I have heard talk about using a column store and run length encoding approach for storage which makes it seem like varint encoding wouldn't be the tool of choice for storage either. The code changes don't look bad. It's mostly swapping types for streams and changes to calculating serialized size so that it is aware of the impact of variable length encoded integers. It could save bandwidth, but it could also be slower since you spend more cycles calculating serialized size and encoding/decoding integers. If you end up using compression in bandwidth sensitive scenarios you may not win much. Not varint encoding the data going in/out of the database means you only save real space proportionally when you have small operations going in/out. The flip side is that you can't do that many small ops anyways so you aren't bandwidth constrained. was (Author: aweisberg): Is bandwidth a constraint for WAN replication? In practice is the default for messaging to have compression on? What are people doing in the wild? I could imagine varint encoding being a win for Cells where the names and values are integers and queries are bulk loading or selecting ranges. At the storage level it seems like the kind of thing that could beat general purpose compression if you know what data type you are dealing with and have a lot of 0 padded values. I have heard talk about using a column store and run length encoding approach for storage which makes it seem like varint encoding would be the tool of choice for storage either. The code changes don't look bad. It's mostly swapping types for streams and changes to calculating serialized size so that it is aware of the impact of variable length encoded integers. It could save bandwidth, but it could also be slower since you spend more cycles calculating serialized size and encoding/decoding integers. If you end up using compression in bandwidth sensitive scenarios you may not win much. Not varint encoding the data going in/out of the database means you only save real space proportionally when you have small operations going in/out. The flip side is that you can't do that many small ops anyways so you aren't bandwidth constrained. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241665#comment-14241665 ] Joshua McKenzie commented on CASSANDRA-8390: [~alexander_radzin]: how many runs does it take with the attached cqlSync test to reproduce? Also - are you running 2.1.1 or 2.1.2? I'm thus far unable to reproduce on either win7 or win8.1 with cqlSync, with or without memory-mapped index files. Also - is there an antivirus client installed in this test environment? We've seen issues w/file access violations on Windows in the past due to that as well. The process cannot access the file because it is being used by another process -- Key: CASSANDRA-8390 URL: https://issues.apache.org/jira/browse/CASSANDRA-8390 Project: Cassandra Issue Type: Bug Reporter: Ilya Komolkin Assignee: Joshua McKenzie Fix For: 2.1.3 21:46:27.810 [NonPeriodicTasks:1] ERROR o.a.c.service.CassandraDaemon - Exception in thread Thread[NonPeriodicTasks:1,5,main] org.apache.cassandra.io.FSWriteError: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:135) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:121) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTable.delete(SSTable.java:113) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableDeletingTask.run(SSTableDeletingTask.java:94) ~[cassandra-all-2.1.1.jar:2.1.1] at org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:664) ~[cassandra-all-2.1.1.jar:2.1.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_71] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) ~[na:1.7.0_71] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_71] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] Caused by: java.nio.file.FileSystemException: E:\Upsource_12391\data\cassandra\data\kernel\filechangehistory_t-a277b560764611e48c8e4915424c75fe\kernel-filechangehistory_t-ka-33-Index.db: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) ~[na:1.7.0_71] at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) ~[na:1.7.0_71] at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) ~[na:1.7.0_71] at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) ~[na:1.7.0_71] at java.nio.file.Files.delete(Files.java:1079) ~[na:1.7.0_71] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:131) ~[cassandra-all-2.1.1.jar:2.1.1] ... 11 common frames omitted -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-7061) High accuracy, low overhead local read/write tracing
[ https://issues.apache.org/jira/browse/CASSANDRA-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg reassigned CASSANDRA-7061: - Assignee: Ariel Weisberg High accuracy, low overhead local read/write tracing Key: CASSANDRA-7061 URL: https://issues.apache.org/jira/browse/CASSANDRA-7061 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: Ariel Weisberg Fix For: 3.0 External profilers are pretty inadequate for getting accurate information at the granularity we're working at: tracing is too high overhead, so measures something completely different, and sampling suffers from bias of attribution due to the way the stack traces are retrieved. Hyperthreading can make this even worse. I propose to introduce an extremely low overhead tracing feature that must be enabled with a system property that will trace operations within the node only, so that we can perform various accurate low level analyses of performance. This information will include threading info, so that we can trace hand off delays and actual active time spent processing an operation. With the property disabled there will be no increased burden of tracing, however I hope to keep the total trace burden to less than one microsecond, and any single trace command to a few tens of nanos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8456) Some valid index queries can be considered as invalid
Benjamin Lerer created CASSANDRA-8456: - Summary: Some valid index queries can be considered as invalid Key: CASSANDRA-8456 URL: https://issues.apache.org/jira/browse/CASSANDRA-8456 Project: Cassandra Issue Type: Bug Reporter: Benjamin Lerer Assignee: Benjamin Lerer Some secondary index queries are rejected or need ALLOW FILTERING but should not. It seems that in certain case {{SelectStatement}} use index filtering for clustering column restrictions while it should be using clustering column slices. The following unit tests can be used to reproduce the problem in 3.0 {code} @Test public void testMultipleClusteringWithIndex() throws Throwable { createTable(CREATE TABLE %s (a int, b int, c int, d int, e int, PRIMARY KEY (a, b, c, d))); createIndex(CREATE INDEX ON %s (b)); createIndex(CREATE INDEX ON %s (e)); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 0, 0, 0, 0); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 0, 1, 0, 1); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 0, 1, 1, 2); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 1, 0, 0, 0); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 1, 1, 0, 1); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 1, 1, 1, 2); execute(INSERT INTO %s (a, b, c, d, e) VALUES (?, ?, ?, ?, ?), 0, 2, 0, 0, 0); assertRows(execute(SELECT * FROM %s WHERE (b, c) = (?, ?), 1, 1), row(0, 1, 1, 0, 1), row(0, 1, 1, 1, 2)); } @Test public void testMultiplePartitionKeyAndMultiClusteringWithIndex() throws Throwable { createTable(CREATE TABLE %s (a int, b int, c int, d int, e int, f int, PRIMARY KEY ((a, b), c, d, e))); createIndex(CREATE INDEX ON %s (c)); createIndex(CREATE INDEX ON %s (f)); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 0, 0, 0, 0); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 0, 1, 0, 1); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 0, 1, 1, 2); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 1, 0, 0, 3); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 1, 1, 0, 4); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 1, 1, 1, 5); execute(INSERT INTO %s (a, b, c, d, e, f) VALUES (?, ?, ?, ?, ?, ?), 0, 0, 2, 0, 0, 6); assertRows(execute(SELECT * FROM %s WHERE a = ? AND (c) IN ((?), (?)) AND f = ?, 0, 1, 2, 5), row(0, 0, 1, 1, 1, 5)); assertRows(execute(SELECT * FROM %s WHERE a = ? AND (c, d) IN ((?, ?)) AND f = ?, 0, 1, 1, 5), row(0, 0, 1, 1, 1, 5)); assertRows(execute(SELECT * FROM %s WHERE a = ? AND (c) = (?) AND f = ?, 0, 1, 5), row(0, 0, 1, 1, 1, 5)); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8449) Allow zero-copy reads again
[ https://issues.apache.org/jira/browse/CASSANDRA-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241737#comment-14241737 ] Jonathan Ellis commented on CASSANDRA-8449: --- Right, even after 7392 a timeout approach is dangerous. bq. Typically I would not want to use this approach for guarding operations that could take arbitrarily long, but really all we're sacrificing is virtual address space Can you spell that out for me? Isn't the existing use of OpOrder technically arbitrarily long due to GC for instance? Allow zero-copy reads again --- Key: CASSANDRA-8449 URL: https://issues.apache.org/jira/browse/CASSANDRA-8449 Project: Cassandra Issue Type: Improvement Reporter: T Jake Luciani Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 3.0 We disabled zero-copy reads in CASSANDRA-3179 due to in flight reads accessing a ByteBuffer when the data was unmapped by compaction. Currently this code path is only used for uncompressed reads. The actual bytes are in fact copied to the client output buffers for both netty and thrift before being sent over the wire, so the only issue really is the time it takes to process the read internally. This patch adds a slow network read test and changes the tidy() method to actually delete a sstable once the readTimeout has elapsed giving plenty of time to serialize the read. Removing this copy causes significantly less GC on the read path and improves the tail latencies: http://cstar.datastax.com/graph?stats=c0c8ce16-7fea-11e4-959d-42010af0688fmetric=gc_countoperation=2_readsmoothing=1show_aggregates=truexmin=0xmax=109.34ymin=0ymax=5.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[4/6] cassandra git commit: merge from 2.0
merge from 2.0 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/27c67ad8 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/27c67ad8 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/27c67ad8 Branch: refs/heads/trunk Commit: 27c67ad851651cb49c9d1cae7d478b831e372aaf Parents: 29259cb 5784309 Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:48 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:48 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/27c67ad8/CHANGES.txt -- diff --cc CHANGES.txt index 2e74a15,385af01..25e0f47 --- a/CHANGES.txt +++ b/CHANGES.txt @@@ -1,24 -1,5 +1,25 @@@ -2.0.12: +2.1.3 + * Remove tmplink files for offline compactions (CASSANDRA-8321) + * Reduce maxHintsInProgress (CASSANDRA-8415) + * BTree updates may call provided update function twice (CASSANDRA-8018) + * Release sstable references after anticompaction (CASSANDRA-8386) + * Handle abort() in SSTableRewriter properly (CASSANDRA-8320) + * Fix high size calculations for prepared statements (CASSANDRA-8231) + * Centralize shared executors (CASSANDRA-8055) + * Fix filtering for CONTAINS (KEY) relations on frozen collection + clustering columns when the query is restricted to a single + partition (CASSANDRA-8203) + * Do more aggressive entire-sstable TTL expiry checks (CASSANDRA-8243) + * Add more log info if readMeter is null (CASSANDRA-8238) + * add check of the system wall clock time at startup (CASSANDRA-8305) + * Support for frozen collections (CASSANDRA-7859) + * Fix overflow on histogram computation (CASSANDRA-8028) + * Have paxos reuse the timestamp generation of normal queries (CASSANDRA-7801) + * Fix incremental repair not remove parent session on remote (CASSANDRA-8291) + * Improve JBOD disk utilization (CASSANDRA-7386) + * Log failed host when preparing incremental repair (CASSANDRA-8228) +Merged from 2.0: + * Default DTCS base_time_seconds changed to 60 (CASSANDRA-8417) * Refuse Paxos operation with more than one pending endpoint (CASSANDRA-8346) * Throw correct exception when trying to bind a keyspace or table name (CASSANDRA-6952)
[5/6] cassandra git commit: merge from 2.0
merge from 2.0 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/27c67ad8 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/27c67ad8 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/27c67ad8 Branch: refs/heads/cassandra-2.1 Commit: 27c67ad851651cb49c9d1cae7d478b831e372aaf Parents: 29259cb 5784309 Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:48 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:48 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/27c67ad8/CHANGES.txt -- diff --cc CHANGES.txt index 2e74a15,385af01..25e0f47 --- a/CHANGES.txt +++ b/CHANGES.txt @@@ -1,24 -1,5 +1,25 @@@ -2.0.12: +2.1.3 + * Remove tmplink files for offline compactions (CASSANDRA-8321) + * Reduce maxHintsInProgress (CASSANDRA-8415) + * BTree updates may call provided update function twice (CASSANDRA-8018) + * Release sstable references after anticompaction (CASSANDRA-8386) + * Handle abort() in SSTableRewriter properly (CASSANDRA-8320) + * Fix high size calculations for prepared statements (CASSANDRA-8231) + * Centralize shared executors (CASSANDRA-8055) + * Fix filtering for CONTAINS (KEY) relations on frozen collection + clustering columns when the query is restricted to a single + partition (CASSANDRA-8203) + * Do more aggressive entire-sstable TTL expiry checks (CASSANDRA-8243) + * Add more log info if readMeter is null (CASSANDRA-8238) + * add check of the system wall clock time at startup (CASSANDRA-8305) + * Support for frozen collections (CASSANDRA-7859) + * Fix overflow on histogram computation (CASSANDRA-8028) + * Have paxos reuse the timestamp generation of normal queries (CASSANDRA-7801) + * Fix incremental repair not remove parent session on remote (CASSANDRA-8291) + * Improve JBOD disk utilization (CASSANDRA-7386) + * Log failed host when preparing incremental repair (CASSANDRA-8228) +Merged from 2.0: + * Default DTCS base_time_seconds changed to 60 (CASSANDRA-8417) * Refuse Paxos operation with more than one pending endpoint (CASSANDRA-8346) * Throw correct exception when trying to bind a keyspace or table name (CASSANDRA-6952)
[1/6] cassandra git commit: Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417
Repository: cassandra Updated Branches: refs/heads/cassandra-2.0 77df5578a - 578430952 refs/heads/cassandra-2.1 29259cb22 - 27c67ad85 refs/heads/trunk c64ac4188 - 6ce8b3fcb Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/57843095 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/57843095 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/57843095 Branch: refs/heads/cassandra-2.0 Commit: 578430952789bbc2dc7d9b17f4f4b41495d0757f Parents: 77df557 Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:11 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:11 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 3c651ff..385af01 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 2.0.12: + * Default DTCS base_time_seconds changed to 60 (CASSANDRA-8417) * Refuse Paxos operation with more than one pending endpoint (CASSANDRA-8346) * Throw correct exception when trying to bind a keyspace or table name (CASSANDRA-6952) http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java -- diff --git a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java index 9fed3e0..ddc8dc7 100644 --- a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java +++ b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java @@ -26,7 +26,7 @@ public final class DateTieredCompactionStrategyOptions { protected static final TimeUnit DEFAULT_TIMESTAMP_RESOLUTION = TimeUnit.MICROSECONDS; protected static final long DEFAULT_MAX_SSTABLE_AGE_DAYS = 365; -protected static final long DEFAULT_BASE_TIME_SECONDS = 60 * 60; +protected static final long DEFAULT_BASE_TIME_SECONDS = 60; protected static final String TIMESTAMP_RESOLUTION_KEY = timestamp_resolution; protected static final String MAX_SSTABLE_AGE_KEY = max_sstable_age_days; protected static final String BASE_TIME_KEY = base_time_seconds;
[6/6] cassandra git commit: Merge branch 'cassandra-2.1' into trunk
Merge branch 'cassandra-2.1' into trunk Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/6ce8b3fc Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/6ce8b3fc Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/6ce8b3fc Branch: refs/heads/trunk Commit: 6ce8b3fcbbd5f6638ee635fbc395541afdb5eef8 Parents: c64ac41 27c67ad Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:54 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:54 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/6ce8b3fc/CHANGES.txt --
[2/6] cassandra git commit: Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417
Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/57843095 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/57843095 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/57843095 Branch: refs/heads/cassandra-2.1 Commit: 578430952789bbc2dc7d9b17f4f4b41495d0757f Parents: 77df557 Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:11 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:11 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 3c651ff..385af01 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 2.0.12: + * Default DTCS base_time_seconds changed to 60 (CASSANDRA-8417) * Refuse Paxos operation with more than one pending endpoint (CASSANDRA-8346) * Throw correct exception when trying to bind a keyspace or table name (CASSANDRA-6952) http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java -- diff --git a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java index 9fed3e0..ddc8dc7 100644 --- a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java +++ b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java @@ -26,7 +26,7 @@ public final class DateTieredCompactionStrategyOptions { protected static final TimeUnit DEFAULT_TIMESTAMP_RESOLUTION = TimeUnit.MICROSECONDS; protected static final long DEFAULT_MAX_SSTABLE_AGE_DAYS = 365; -protected static final long DEFAULT_BASE_TIME_SECONDS = 60 * 60; +protected static final long DEFAULT_BASE_TIME_SECONDS = 60; protected static final String TIMESTAMP_RESOLUTION_KEY = timestamp_resolution; protected static final String MAX_SSTABLE_AGE_KEY = max_sstable_age_days; protected static final String BASE_TIME_KEY = base_time_seconds;
[3/6] cassandra git commit: Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417
Default DTCS base_time_seconds changed to 60 patch by Björn Hegerfors; reviewed by jbellis for CASSANDRA-8417 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/57843095 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/57843095 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/57843095 Branch: refs/heads/trunk Commit: 578430952789bbc2dc7d9b17f4f4b41495d0757f Parents: 77df557 Author: Jonathan Ellis jbel...@apache.org Authored: Wed Dec 10 15:19:11 2014 -0600 Committer: Jonathan Ellis jbel...@apache.org Committed: Wed Dec 10 15:19:11 2014 -0600 -- CHANGES.txt| 1 + .../db/compaction/DateTieredCompactionStrategyOptions.java | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 3c651ff..385af01 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 2.0.12: + * Default DTCS base_time_seconds changed to 60 (CASSANDRA-8417) * Refuse Paxos operation with more than one pending endpoint (CASSANDRA-8346) * Throw correct exception when trying to bind a keyspace or table name (CASSANDRA-6952) http://git-wip-us.apache.org/repos/asf/cassandra/blob/57843095/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java -- diff --git a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java index 9fed3e0..ddc8dc7 100644 --- a/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java +++ b/src/java/org/apache/cassandra/db/compaction/DateTieredCompactionStrategyOptions.java @@ -26,7 +26,7 @@ public final class DateTieredCompactionStrategyOptions { protected static final TimeUnit DEFAULT_TIMESTAMP_RESOLUTION = TimeUnit.MICROSECONDS; protected static final long DEFAULT_MAX_SSTABLE_AGE_DAYS = 365; -protected static final long DEFAULT_BASE_TIME_SECONDS = 60 * 60; +protected static final long DEFAULT_BASE_TIME_SECONDS = 60; protected static final String TIMESTAMP_RESOLUTION_KEY = timestamp_resolution; protected static final String MAX_SSTABLE_AGE_KEY = max_sstable_age_days; protected static final String BASE_TIME_KEY = base_time_seconds;
[jira] [Resolved] (CASSANDRA-6060) Remove internal use of Strings for ks/cf names
[ https://issues.apache.org/jira/browse/CASSANDRA-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis resolved CASSANDRA-6060. --- Resolution: Won't Fix Fix Version/s: (was: 3.0) All right, let's wontfix this. Remove internal use of Strings for ks/cf names -- Key: CASSANDRA-6060 URL: https://issues.apache.org/jira/browse/CASSANDRA-6060 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance We toss a lot of Strings around internally, including across the network. Once a request has been Prepared, we ought to be able to encode these as int ids. Unfortuntely, we moved from int to uuid in CASSANDRA-3794, which was a reasonable move at the time, but a uuid is a lot bigger than an int. Now that we have CAS we can allow concurrent schema updates while still using sequential int IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8457) nio MessagingService
Jonathan Ellis created CASSANDRA-8457: - Summary: nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-6060) Remove internal use of Strings for ks/cf names
[ https://issues.apache.org/jira/browse/CASSANDRA-6060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241364#comment-14241364 ] Ariel Weisberg edited comment on CASSANDRA-6060 at 12/10/14 9:30 PM: - I am still digging but I am not sure there is much value here. For prepared statements between client and server there are no ks/cf names. Here is the breakdown for a minimum size mutation inside the cluster Size of Ethernet frame - 24 Bytes Size of IPv4 Header (without any options) - 20 bytes Size of TCP Header (without any options) - 20 Bytes 4-bytes protocol magic 4-bytes version 4-bytes timestamp 4-bytes verb 4-bytes parameter count 4-bytes payload length prefix No keyspace name in current versions 2-byte key length key say 10 bytes 4-byte mutation count 1-byte boolean 16-byte cf id 4-byte count of columns Per column 2-byte column name length prefix column name say 8 bytes 1-byte serialization flags 8-byte timestamp 4-byte length prefix column value say 8 bytes Total is 158 bytes. Saving 12 bytes on the CF uuid would be 7.5 %. For single CF mutations this is not a win. Loading data points 16 bytes at a time isn't going to work so hot anyways so people might look into batching at that point. The UUID is not repeated for each cell so it is a one time cost for workloads that modify multiple cells per CF. The one case where the 12-bytes becomes significant is single cell updates to multiple CFs in one mutation. There the 12-byte overhead converges on 23%. I am going to look at the read path next, but I kind of expect to find something similar. A read is going t o have key overhead and possibly overhead for all the other query parameters that should match the simple single cell mutation case. was (Author: aweisberg): I am still digging but I am not sure there is much value here. For prepared statements between client and server there are no ks/cf names. Here is the breakdown for a minimum size mutation inside the cluster Size of Ethernet frame - 24 Bytes Size of IPv4 Header (without any options) - 20 bytes Size of TCP Header (without any options) - 20 Bytes 4-bytes protocol magic 4-bytes version 4-bytes timestamp 4-bytes verb 4-bytes parameter count 4-bytes payload length prefix No keyspace name in current versions 2-byte key length key say 10 bytes 4-byte mutation count 1-byte boolean 16-byte cf id 4-byte count of columns Per column 2-byte column name length prefix column name say 8 bytes 1-byte serialization flags 8-byte timestamp 4-byte length prefix column value say 8 bytes Total is 158 bytes. Saving 12 bytes on the CF uuid would be 7.5 %. For single CF mutations this is not a win. Loading data points 16 bytes at a time isn't going to work so hot anyways so people might look into batching at that point. The UUID is not repeated for each cell so it is a one time cost so for workloads that modify multiple cells per CF. The one case where the 12-bytes becomes significant is single cell updates to multiple CFs in one mutation. There the 12-byte overhead converges on 23%. I am going to look at the read path next, but I kind of expect to find something similar. A read is going t o have key overhead and possibly overhead for all the other query parameters that should match the simple single cell mutation case. Remove internal use of Strings for ks/cf names -- Key: CASSANDRA-6060 URL: https://issues.apache.org/jira/browse/CASSANDRA-6060 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance We toss a lot of Strings around internally, including across the network. Once a request has been Prepared, we ought to be able to encode these as int ids. Unfortuntely, we moved from int to uuid in CASSANDRA-3794, which was a reasonable move at the time, but a uuid is a lot bigger than an int. Now that we have CAS we can allow concurrent schema updates while still using sequential int IDs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-8457: -- Labels: performance (was: ) nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8418) Queries that require allow filtering are working without it
[ https://issues.apache.org/jira/browse/CASSANDRA-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241781#comment-14241781 ] Benjamin Lerer commented on CASSANDRA-8418: --- All my apologies. I was wrong the DTest was right. The queries do not need {{ALLOW FILTERING}} because as {{time1}} is a clustering column the secondary index code can use a {{SliceQueryFilter}} instead of doing some post filtering on the results returned by the index. Queries that require allow filtering are working without it --- Key: CASSANDRA-8418 URL: https://issues.apache.org/jira/browse/CASSANDRA-8418 Project: Cassandra Issue Type: Bug Reporter: Philip Thompson Assignee: Benjamin Lerer Priority: Minor Fix For: 2.0.12, 2.1.3 The trunk dtest {{cql_tests.py:TestCQL.composite_index_with_pk_test}} has begun failing after the changes to CASSANDRA-7981. With the schema {code}CREATE TABLE blogs ( blog_id int, time1 int, time2 int, author text, content text, PRIMARY KEY (blog_id, time1, time2){code} and {code}CREATE INDEX ON blogs(author){code}, then the query {code}SELECT blog_id, content FROM blogs WHERE time1 0 AND author='foo'{code} now requires ALLOW FILTERING, but did not before the refactor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241785#comment-14241785 ] Jonathan Ellis commented on CASSANDRA-8447: --- Is that still reporting serialized/live size as the same on flush? Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS
[jira] [Updated] (CASSANDRA-8418) Queries that require allow filtering are working without it
[ https://issues.apache.org/jira/browse/CASSANDRA-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Lerer updated CASSANDRA-8418: -- Attachment: CASSANDRA-8418.txt The patch fix the code of trunk and the unit tests. Queries that require allow filtering are working without it --- Key: CASSANDRA-8418 URL: https://issues.apache.org/jira/browse/CASSANDRA-8418 Project: Cassandra Issue Type: Bug Reporter: Philip Thompson Assignee: Benjamin Lerer Priority: Minor Fix For: 2.0.12, 2.1.3 Attachments: CASSANDRA-8418.txt The trunk dtest {{cql_tests.py:TestCQL.composite_index_with_pk_test}} has begun failing after the changes to CASSANDRA-7981. With the schema {code}CREATE TABLE blogs ( blog_id int, time1 int, time2 int, author text, content text, PRIMARY KEY (blog_id, time1, time2){code} and {code}CREATE INDEX ON blogs(author){code}, then the query {code}SELECT blog_id, content FROM blogs WHERE time1 0 AND author='foo'{code} now requires ALLOW FILTERING, but did not before the refactor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4139) Add varint encoding to Messaging service
[ https://issues.apache.org/jira/browse/CASSANDRA-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241790#comment-14241790 ] Jonathan Ellis commented on CASSANDRA-4139: --- bq. Is bandwidth a constraint for WAN replication? In practice is the default for messaging to have compression on? Often, yes. internode_compression has defaulted to all for a while now. Most people probably leave it at that; the rest change it to dc. Add varint encoding to Messaging service Key: CASSANDRA-4139 URL: https://issues.apache.org/jira/browse/CASSANDRA-4139 Project: Cassandra Issue Type: Sub-task Components: Core Reporter: Vijay Assignee: Ariel Weisberg Fix For: 3.0 Attachments: 0001-CASSANDRA-4139-v1.patch, 0001-CASSANDRA-4139-v2.patch, 0001-CASSANDRA-4139-v4.patch, 0002-add-bytes-written-metric.patch, 4139-Test.rtf, ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-4139-v3.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241902#comment-14241902 ] jonathan lacefield commented on CASSANDRA-8447: --- they are still very close, yes. INFO [OptionalTasks:1] 2014-12-10 11:11:56,736 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@467482876(21395220/213952200 serialized/live bytes, 486255 ops) INFO [OptionalTasks:1] 2014-12-10 11:11:58,746 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@550824252(20002840/200028400 serialized/live bytes, 454610 ops) INFO [OptionalTasks:1] 2014-12-10 11:12:00,765 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@1776946438(19270460/192704600 serialized/live bytes, 437965 ops) INFO [OptionalTasks:1] 2014-12-10 11:12:02,777 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@2007866469(20061800/200618000 serialized/live bytes, 455950 ops) INFO [OptionalTasks:1] 2014-12-10 11:12:04,946 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@458183382(19050680/190506800 serialized/live bytes, 432970 ops) INFO [OptionalTasks:1] 2014-12-10 11:12:06,961 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@2027660149(23800920/238009200 serialized/live bytes, 540930 ops) INFO [OptionalTasks:1] 2014-12-10 11:12:09,237 ColumnFamilyStore.java (line 794) Enqueuing flush of Memtable-Standard1@841856891(21873060/218730600 serialized/live bytes, 497115 ops) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8390) The process cannot access the file because it is being used by another process
[ https://issues.apache.org/jira/browse/CASSANDRA-8390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14241928#comment-14241928 ] Alexander Radzin commented on CASSANDRA-8390: - I have cassandra 2.1.2, windowss 8.1. Typically it takes 1.5 iterations of main loop to reproduce the problem with cqlSync. As a windows 8 user I have Windows Defender. It is turned on. I have just ran the test again and it passed. Then I changed the year from 2013 to 2014 and ran the test again. It failed when arrived to month 11. {noformat} CREATE TABLE measure_201411.index_bcon_page_load_aggregation (partition ascii, attr ascii, value varchar, time timeuuid, bloom blob, PRIMARY KEY (partition, attr, value, time)) WITH compaction = { 'class' : 'SizeTieredCompactionStrategy', 'min_threshold' : 40, 'max_threshold' : 45 } AND gc_grace_seconds = 0 AND memtable_flush_period_in_ms = 30; CREATE TABLE measure_201411.bcon_page_event_aggregation (partition ascii, time timeuuid, data blob, PRIMARY KEY (partition, time)) WITH compaction = { 'class' : 'SizeTieredCompactionStrategy', 'min_threshold' : 40, 'max_threshold' : 45 } AND gc_grace_seconds = 0 AND memtable_flush_period_in_ms = 30; com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.TransportException: [localhost/127.0.0.1:9042] Error writing: Closed channel)) at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65) at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:259) at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:175) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52) at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:36) at com.clarisite.clingine.dataaccesslayer.cassandra.CQLTest.cqlSync(CQLTest.java:32) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at com.intellij.junit4.JUnit4TestRunnerUtil$IgnoreIgnoredTestJUnit4ClassRunner.runChild(JUnit4TestRunnerUtil.java:269) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:202) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:65) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.CommandLineWrapper.main(CommandLineWrapper.java:121) Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.TransportException: [localhost/127.0.0.1:9042] Error writing: Closed channel)) at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:102) at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242160#comment-14242160 ] Jonathan Ellis commented on CASSANDRA-8447: --- ... realized that the live bytes have an extra zero. So it's actually 10.0 liveRatio, which is what it defaults to when it hasn't been computed on a fresh memtable. Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled --- Key: CASSANDRA-8447 URL: https://issues.apache.org/jira/browse/CASSANDRA-8447 Project: Cassandra Issue Type: Bug Components: Core Environment: Cluster size - 4 nodes Node size - 12 CPU (hyper threaded to 24 cores), 192 GB RAM, 2 Raid 0 arrays (Data - 10 disk, spinning 10k drives | CL 2 disk, spinning 10k drives) OS - RHEL 6.5 jvm - oracle 1.7.0_71 Cassandra version 2.0.11 Reporter: jonathan lacefield Attachments: Node_with_compaction.png, Node_without_compaction.png, cassandra.yaml, gc.logs.tar.gz, gcinspector_messages.txt, memtable_debug, results.tar.gz, visualvm_screenshot Behavior - If autocompaction is enabled, nodes will become unresponsive due to a full Old Gen heap which is not cleared during CMS GC. Test methodology - disabled autocompaction on 3 nodes, left autocompaction enabled on 1 node. Executed different Cassandra stress loads, using write only operations. Monitored visualvm and jconsole for heap pressure. Captured iostat and dstat for most tests. Captured heap dump from 50 thread load. Hints were disabled for testing on all nodes to alleviate GC noise due to hints backing up. Data load test through Cassandra stress - /usr/bin/cassandra-stress write n=19 -rate threads=different threads tested -schema replication\(factor=3\) keyspace=Keyspace1 -node all nodes listed Data load thread count and results: * 1 thread - Still running but looks like the node can sustain this load (approx 500 writes per second per node) * 5 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 2k writes per second per node) * 10 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range * 50 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 10k writes per second per node) * 100 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 20k writes per second per node) * 200 threads - Nodes become unresponsive due to full Old Gen Heap. CMS measured in the 60 second range (approx 25k writes per second per node) Note - the observed behavior was the same for all tests except for the single threaded test. The single threaded test does not appear to show this behavior. Tested different GC and Linux OS settings with a focus on the 50 and 200 thread loads. JVM settings tested: # default, out of the box, env-sh settings # 10 G Max | 1 G New - default env-sh settings # 10 G Max | 1 G New - default env-sh settings #* JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=50 # 20 G Max | 10 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:+UnlockDiagnosticVMOptions JVM_OPTS=$JVM_OPTS -XX:+UseGCTaskAffinity JVM_OPTS=$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs JVM_OPTS=$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32768 JVM_OPTS=$JVM_OPTS -XX:-UseBiasedLocking # 20 G Max | 1 G New JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=8 JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseTLAB JVM_OPTS=$JVM_OPTS -XX:+CMSScavengeBeforeRemark JVM_OPTS=$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=6 JVM_OPTS=$JVM_OPTS -XX:CMSWaitDuration=3 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=12 JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=12 JVM_OPTS=$JVM_OPTS
[jira] [Commented] (CASSANDRA-8447) Nodes stuck in CMS GC cycle with very little traffic when compaction is enabled
[ https://issues.apache.org/jira/browse/CASSANDRA-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242183#comment-14242183 ] Philo Yang commented on CASSANDRA-8447: --- I have the same trouble that full gc can not reduce the size of old gen. Days ago I post this problem to the maillist, people think it will be solved by tuning the gc setting, however it doesn't work for me. After seeing this issue I think it may be a bug? I can offer some information in my cluster, hope I can help you find the bug if it is a bug. Of course, the reason of my trouble may be different with [~jlacefie]. The unresponsive only appears in some nodes, in other words, some nodes unresponsive several times a day, the other nodes never. When there is no trouble, the load for all nodes are same, so I think the unresponsive is not because the heavy load. Before the node being unresponsive, it will be easily to find by jstat that after CMS gc, old gen is still very large (in usual it will be only less than 1GB after full gc but when the trouble comes it will be still more than 4GB after CMS gc). And there may be a compaction that stucks for many minutes in 99% even 99.99% like this: pending tasks: 1 compaction type keyspace table completed totalunit progress Compaction keyspace table 354680703 354710642 bytes 99.99% But I'm not sure the trouble is always with the compaction stuck because I don't follow all the unresponsive. I use jmap to print the object that holds in heap, I don't know if it is helpful to you: num #instances #bytes class name -- 1: 11899268 3402792016 [B 2: 23734819 1139271312 java.nio.HeapByteBuffer 3: 11140273 306165600 [Ljava.nio.ByteBuffer; 4: 9484838 227636112 org.apache.cassandra.db.composites.CompoundComposite 5: 8220604 197294496 org.apache.cassandra.db.composites.BoundedComposite 6: 27187 69131928 [J 7: 1673344 53547008 org.apache.cassandra.db.composites.CompoundSparseCellName 8: 1540101 49283232 org.apache.cassandra.db.BufferCell 9: 3158 45471360 [Lorg.apache.cassandra.db.composites.Composite; 10: 2527 27865040 [I 11:251797 20236456 [Ljava.lang.Object; 12:417752 12899976 [C 13:263201 10528040 org.apache.cassandra.db.BufferExpiringCell 14:322324 10314368 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$Node 15:322237 10311584 com.googlecode.concurrentlinkedhashmap.ConcurrentHashMapV8$Node 16:417331 10015944 java.lang.String 17: 863688891280 [Lorg.apache.cassandra.db.Cell; 18:3499178398008 org.apache.cassandra.cql3.ColumnIdentifier 19:2041618166440 java.util.TreeMap$Entry 20:3223247735776 com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue 21:3223247735776 org.apache.cassandra.cache.KeyCacheKey 22:3172747614576 java.lang.Double 23:3171547611696 org.apache.cassandra.db.RowIndexEntry 24:3146427551408 java.util.concurrent.ConcurrentSkipListMap$Node 25: 525607316584 constMethodKlass 26:2921367011264 java.lang.Long 27: 525606740064 methodKlass 28: 52905937512 constantPoolKlass 29:1602813846744 org.apache.cassandra.db.BufferDecoratedKey 30:1557773738648 java.util.concurrent.ConcurrentSkipListMap$Index 31: 52903642232 instanceKlassKlass 32:1502843606816 org.apache.cassandra.db.AtomicBTreeColumns 33:1502613606264 org.apache.cassandra.db.AtomicBTreeColumns$Holder 34: 878613514440 org.apache.cassandra.db.ArrayBackedSortedColumns 35: 877683510720 org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask 36: 46343372096 constantPoolCacheKlass 37: 842623370480 java.util.Collections$SingletonMap 38: 62432778728 methodDataKlass 39:1734902775840 org.apache.cassandra.dht.LongToken 40: 820002624000 java.util.RegularEnumSet 41: 819812623392 org.apache.cassandra.net.MessageIn 42: 819802623360 org.apache.cassandra.net.MessageDeliveryTask 43:1025112460264 java.util.concurrent.ConcurrentLinkedQueue$Node 44: 949012277624 org.apache.cassandra.db.DeletionInfo 45: 938372252088 java.util.concurrent.Executors$RunnableAdapter 46:140525