Re: Datastax EC2 Ami
Hi, I would remove the node and start a new one. You can pick a specific Cassandra release using user data (eg. --release 2.0.11) Cheers, Tommaso On Mon, Apr 27, 2015 at 8:53 PM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hi Guys, we start our cassandra cluster with the following ami : ami-ada2b6c4 https://console.aws.amazon.com/ec2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-ada2b6c4 Now we need to add a new node and we realize this ami has cassandra 2.1.4 intead of 2.1.0-2. Is it safe to join this node to the cluster? Or do we need to dowgrade on the new node? Regards Eduardo
RE: Data model suggestions
I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace. Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot From: Ali Akhtar [mailto:ali.rac...@gmail.com] Sent: Sunday, April 26, 2015 10:31 PM To: user@cassandra.apache.org Subject: Re: Data model suggestions Thanks Peer. I like the approach you're suggesting. Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything). Thanks. On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded oded.p...@rsa.commailto:oded.p...@rsa.com wrote: I would maintain two tables. An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead). An “active” table which holds all the records in the last external API invocation. To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table. I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent. Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table. No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots. From: Narendra Sharma [mailto:narendra.sha...@gmail.commailto:narendra.sha...@gmail.com] Sent: Friday, April 24, 2015 6:53 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data model suggestions I think one table say record should be good. The primary key is record id. This will ensure good distribution. Just update the active attribute to true or false. For range query on active vs archive records maintain 2 indexes or try secondary index. On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote: Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be 500k, the ids may just be cached in memory. Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time. The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues? On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar khangaon...@gmail.commailto:khangaon...@gmail.com wrote: Hi, If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active. You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key. To comment more, We would need to see your proposed cassandra tables and queries that you might need to run. regards On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote: That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record. On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.commailto:khangaon...@gmail.com wrote: Hi, How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ? regards On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote: Hey all, We are working on moving a mysql based application to Cassandra. The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive' The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more. Is it advisable to keep the same data model in Cassandra? I'm concerned about
Re: Data model suggestions
Wouldn't truncating the table create tombstones? On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded oded.p...@rsa.com wrote: I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace. Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] *Sent:* Sunday, April 26, 2015 10:31 PM *To:* user@cassandra.apache.org *Subject:* Re: Data model suggestions Thanks Peer. I like the approach you're suggesting. Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything). Thanks. On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded oded.p...@rsa.com wrote: I would maintain two tables. An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead). An “active” table which holds all the records in the last external API invocation. To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table. I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent. Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table. No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots. *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com] *Sent:* Friday, April 24, 2015 6:53 AM *To:* user@cassandra.apache.org *Subject:* Re: Data model suggestions I think one table say record should be good. The primary key is record id. This will ensure good distribution. Just update the active attribute to true or false. For range query on active vs archive records maintain 2 indexes or try secondary index. On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.com wrote: Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be 500k, the ids may just be cached in memory. Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time. The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues? On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar khangaon...@gmail.com wrote: Hi, If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active. You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key. To comment more, We would need to see your proposed cassandra tables and queries that you might need to run. regards On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.com wrote: That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record. On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.com wrote: Hi, How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ? regards On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.com wrote: Hey all, We are working on moving a mysql based application to Cassandra. The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive' The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it only has to search 200k records vs 3 million or more. Is it advisable to keep the same data model in Cassandra? I'm concerned
is Thrift support, from Cassandra, really mandatory for OpsCenter monitoring ?
Hi, While reading the OpsCenter 5.1 docs, it looks like OpsCenter can't work if Cassandra does not provide a Thrift interface (see [1] below). Is it really the case ? At first sight, it sounded weird to me, as CQL 3 is provided for months. Just to know, is a OpsCenter future version, not relying on a mandatory Thrift interface, on the road ? Thanks. Regards, Dominique [1] in the OpsCenter 5.1 guide : *** Modifying how OpsCenter connects to clusters Cluster Connection settings define how OpsCenter connects to a cluster. About this task The Connection settings for a cluster define how OpsCenter connects to the cluster. For example, if you've enabled authentication or encryption on a cluster, you'll need to specify that information. Procedure 1. Select the cluster you want to edit from the Cluster menu. 2. Click Settings Cluster Connections. The Edit Cluster dialog appears. 3. Change the IP addresses of cluster nodes. 4. Change JMX and Thrift listen port numbers. 5. Edit the user credentials if the JMX or Thrift ports require authentication.
Re: Data model suggestions
No - it immediately removes the sstables on all nodes. On Mon, Apr 27, 2015 at 7:53 AM, Ali Akhtar ali.rac...@gmail.com wrote: Wouldn't truncating the table create tombstones? On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded oded.p...@rsa.com wrote: I recommend truncating the table instead of dropping it since you don’t need to re-issue DDL commands and put load on the system keyspace. Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” advantage for using DROP . See http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] *Sent:* Sunday, April 26, 2015 10:31 PM *To:* user@cassandra.apache.org *Subject:* Re: Data model suggestions Thanks Peer. I like the approach you're suggesting. Why do you recommend truncating the last active table rather than just dropping it? Since all the data would be inserted into a new table, seems like it would make sense to drop the last table, and that way truncate snapshotting also won't have to be dealt with (unless I'm missing anything). Thanks. On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded oded.p...@rsa.com wrote: I would maintain two tables. An “archive” table that holds all the active and inactive records, and is updated hourly (re-inserting the same record has some compaction overhead but on the other side deleting records has tombstones overhead). An “active” table which holds all the records in the last external API invocation. To avoid tombstones and read-before-delete issues “active” should actually a synonym, an alias, to the most recent active table. I suggest you create two identical tables, “active1” and “active2”, and an “active_alias” table that informs which of the two is the most recent. Thus when you query the external API you insert the data to “archive” and to the unaliased “activeN” table, switch the alias value in “active_alias” and truncate the new unaliased “activeM” table. No need to query the data before inserting it. Make sure truncating doesn’t create automatic snapshots. *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com] *Sent:* Friday, April 24, 2015 6:53 AM *To:* user@cassandra.apache.org *Subject:* Re: Data model suggestions I think one table say record should be good. The primary key is record id. This will ensure good distribution. Just update the active attribute to true or false. For range query on active vs archive records maintain 2 indexes or try secondary index. On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.com wrote: Good point about the range selects. I think they can be made to work with limits, though. Or, since the active records will never usually be 500k, the ids may just be cached in memory. Most of the time, during reads, the queries will just consist of select * where primaryKey = someValue . One row at a time. The question is just, whether to keep all records in one table (including archived records which wont be queried 99% of the time), or to keep active records in their own table, and delete them when they're no longer active. Will that produce tombstone issues? On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar khangaon...@gmail.com wrote: Hi, If your external API returns active records, that means I am guessing you need to do a select * on the active table to figure out which records in the table are no longer active. You might be aware that range selects based on partition key will timeout in cassandra. They can however be made to work using the column cluster key. To comment more, We would need to see your proposed cassandra tables and queries that you might need to run. regards On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.com wrote: That's returned by the external API we're querying. We query them for active records, if a previous active record isn't included in the results, that means its time to archive that record. On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.com wrote: Hi, How do you determine if the record is no longer active ? Is it a perioidic process that goes through every record and checks when the last update happened ? regards On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.com wrote: Hey all, We are working on moving a mysql based application to Cassandra. The workflow in mysql is this: We have two tables: active and archive . Every hour, we pull in data from an external API. The records which are active, are kept in 'active' table. Once a record is no longer active, its deleted from 'active' and re-inserted into 'archive' The purpose for that, is because most of the time, queries are only done against the active records rather than archived. Therefore keeping the active table small may help with faster queries, if it
Re: is Thrift support, from Cassandra, really mandatory for OpsCenter monitoring ?
On 04/27/2015 08:18 AM, DE VITO Dominique wrote: Just to know, is a OpsCenter future version, not relying on a mandatory Thrift interface, on the road ? Yes. -- Kind regards, Michael
Re: Best Practice to add a node in a Cluster
Thanks Eric and Matt :) !! Yes the purpose is to improve reliability. Right now, from our driver we are querying using degradePolicy for reliability. *For changing the keyspace for RF=3, the procedure is as under:* 1. Add a new node to the cluster (new node is not in seed list) 2. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3}; 1. On each affected node, run nodetool repair http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html. 2. Wait until repair completes on a node, then move to the next node. Any other things to take care? Thanks Regards neha On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote: It depends on why you're adding a new node. If you're running out of disk space or IO capacity in your 2 node cluster, then changing RF to 3 will not improve either condition - you'd still be writing all data to all three nodes. However if you're looking to improve reliability, a 2 node RF=2 cluster cannot have either node offline without losing quorum, while a 3 node RF=3 cluster can have one node offline and still be able to achieve quorum. RF=3 is a common replication factor because of this characteristic. Make sure your new node is not in its own seeds list, or it will not bootstrap (it will come online immediately and start serving requests). On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com wrote: Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha
Re: Best Practice to add a node in a Cluster
Hi Neha, After you add the node to the cluster, run nodetool cleanup on all nodes. Next running repair on each node will replicate the data. Make sure you run the repair on one node at a time, because repair is an expensive process (Utilizes high CPU). On Mon, Apr 27, 2015 at 8:36 PM, Neha Trivedi nehajtriv...@gmail.com wrote: Thanks Eric and Matt :) !! Yes the purpose is to improve reliability. Right now, from our driver we are querying using degradePolicy for reliability. *For changing the keyspace for RF=3, the procedure is as under:* 1. Add a new node to the cluster (new node is not in seed list) 2. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3}; 1. On each affected node, run nodetool repair http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html. 2. Wait until repair completes on a node, then move to the next node. Any other things to take care? Thanks Regards neha On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote: It depends on why you're adding a new node. If you're running out of disk space or IO capacity in your 2 node cluster, then changing RF to 3 will not improve either condition - you'd still be writing all data to all three nodes. However if you're looking to improve reliability, a 2 node RF=2 cluster cannot have either node offline without losing quorum, while a 3 node RF=3 cluster can have one node offline and still be able to achieve quorum. RF=3 is a common replication factor because of this characteristic. Make sure your new node is not in its own seeds list, or it will not bootstrap (it will come online immediately and start serving requests). On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com wrote: Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha -- Arun Senior Hadoop/Cassandra Engineer Cloudwick Champion of Big Data (Cloudera) http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html 2014 Data Impact Award Winner (Cloudera) http://www.cloudera.com/content/cloudera/en/campaign/data-impact-awards.html
minimum bandwidth requirement between two Geo Redundant sites of Cassandra database
Hi, Is there any minimum bandwidth requirement between two Geo Redundant data centres? What is the minimum latency that link between two Geo Redundant data centres should have to get best efficient operations? Regards, Gaurav
New node got stuck joining the cluster after a while
Hello guys, I have a cluster comprised of 2 nodes, configured with vnodes. Using 2.1.0-2 version of cassandra. And I am facing an issue when I want to joing a new node to the cluster. At first starting joining but then it got stuck: UN 1x.x.x.x 348.11 GB 256 100.0% 1c UN 1x.x.x.x 342.74 GB 256 100.0% 1c UJ 1x.x.x.x 26.86 GB 256 ? 1c I can see some errors on the already working nodes: *WARN [SharedPool-Worker-7] 2015-04-27 17:41:16,060 SliceQueryFilter.java:236 - Read 5001 live and 66548 tombstoned cells in usmc.userpixel (see tombstone_warn_threshol* *d). 5000 columns was requested, slices=[-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647 2147483647}* *WARN [SharedPool-Worker-32] 2015-04-27 17:41:16,668 SliceQueryFilter.java:236 - Read 2012 live and 30440 tombstoned cells in usmc.userpixel (see tombstone_warn_thresho* *ld). 5001 columns was requested, slices=[b6d051df-0a8f-4c13-b93c-1b4ff0d82b8d:date-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}* *ERROR [CompactionExecutor:35638] 2015-04-27 19:06:07,613 CassandraDaemon.java:166 - Exception in thread Thread[CompactionExecutor:35638,1,main]* *java.lang.AssertionError: Memory was freed* *at org.apache.cassandra.io.util.Memory.checkPosition(Memory.java:281) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.io.util.Memory.getInt(Memory.java:233) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.io.sstable.IndexSummary.getPositionInSummary(IndexSummary.java:118) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.io.sstable.IndexSummary.getKey(IndexSummary.java:123) ~[apache-cassandra-2.1.0.jar:2.1.0]* * at org.apache.cassandra.io.sstable.IndexSummary.binarySearch(IndexSummary.java:92) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.io.sstable.SSTableReader.getSampleIndexesForRanges(SSTableReader.java:1209) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.io.sstable.SSTableReader.estimatedKeysForRanges(SSTableReader.java:1165) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:328) ~[apache-cassandra-2.1.0.jar:2.1.0* *]* *at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.findDroppableSSTable(LeveledCompactionStrategy.java:365) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:127) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:112) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:229) ~[apache-cassandra-2.1.0.jar:2.1.0]* *at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51]* *at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51]* *at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_51]* *at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51]* *at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]* But I do not see any warning or error message in logs of the joining nodes. I just see an exception there when I run: nodetool info: root@:~# nodetool info ID : f5e49647-59fa-474f-b6af-9f65abc43581 Gossip active: true Thrift active: false Native Transport active: false Load : 26.86 GB Generation No: 1430163258 Uptime (seconds) : 18799 Heap Memory (MB) : 4185.15 / 7566.00 error: null -- StackTrace -- java.lang.AssertionError at org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:440) at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2079) at org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2068) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112) at
Re: Best Practice to add a node in a Cluster
Thans Arun ! On Tue, Apr 28, 2015 at 9:44 AM, arun sirimalla arunsi...@gmail.com wrote: Hi Neha, After you add the node to the cluster, run nodetool cleanup on all nodes. Next running repair on each node will replicate the data. Make sure you run the repair on one node at a time, because repair is an expensive process (Utilizes high CPU). On Mon, Apr 27, 2015 at 8:36 PM, Neha Trivedi nehajtriv...@gmail.com wrote: Thanks Eric and Matt :) !! Yes the purpose is to improve reliability. Right now, from our driver we are querying using degradePolicy for reliability. *For changing the keyspace for RF=3, the procedure is as under:* 1. Add a new node to the cluster (new node is not in seed list) 2. ALTER KEYSPACE system_auth WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3}; 1. On each affected node, run nodetool repair http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html. 2. Wait until repair completes on a node, then move to the next node. Any other things to take care? Thanks Regards neha On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote: It depends on why you're adding a new node. If you're running out of disk space or IO capacity in your 2 node cluster, then changing RF to 3 will not improve either condition - you'd still be writing all data to all three nodes. However if you're looking to improve reliability, a 2 node RF=2 cluster cannot have either node offline without losing quorum, while a 3 node RF=3 cluster can have one node offline and still be able to achieve quorum. RF=3 is a common replication factor because of this characteristic. Make sure your new node is not in its own seeds list, or it will not bootstrap (it will come online immediately and start serving requests). On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com wrote: Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha -- Arun Senior Hadoop/Cassandra Engineer Cloudwick Champion of Big Data (Cloudera) http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html 2014 Data Impact Award Winner (Cloudera) http://www.cloudera.com/content/cloudera/en/campaign/data-impact-awards.html
Best Practice to add a node in a Cluster
Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha
RE: Best Practice to add a node in a Cluster
Hi Neha, I guess it depends why you are adding a new node – do you need more storage capacity, do you want better resilience, or are you trying to increase performance? If you add a new node with the same amount of storage as the previous two, but you increase the RF, you will use up all of the storage you have added by replicating the existing data onto the new node. If you keep it at RF=2, once you have done all the bootstrapping and cleanup then your usage on the existing two should decrease by about 30% (of their total size). However, if it is resilience you are after (being able to take down nodes without losing availability) then increasing the RF will give you this, at the expense of using more storage. Hope that helps. Cheers, Matt *From:* Neha Trivedi [mailto:nehajtriv...@gmail.com] *Sent:* 27 April 2015 16:46 *To:* user@cassandra.apache.org *Subject:* Best Practice to add a node in a Cluster Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha
Re: Best Practice to add a node in a Cluster
It depends on why you're adding a new node. If you're running out of disk space or IO capacity in your 2 node cluster, then changing RF to 3 will not improve either condition - you'd still be writing all data to all three nodes. However if you're looking to improve reliability, a 2 node RF=2 cluster cannot have either node offline without losing quorum, while a 3 node RF=3 cluster can have one node offline and still be able to achieve quorum. RF=3 is a common replication factor because of this characteristic. Make sure your new node is not in its own seeds list, or it will not bootstrap (it will come online immediately and start serving requests). On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com wrote: Hi We have a 2 Cluster Node with RF=2. We are planing to add a new node. Should we change RF to 3 in the schema? OR Just added a new node with the same RF=2? Any other Best Practice that we need to take care? Thanks regards Neha
Fwd: Data Modelling Help
Hi, I am a newbie with Cassandra and thus need data modelling help as I haven't found a resource that tackles the same problem. The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways: 1. All emails ever received 2. Mails that have been read by a user 3. Mails that are still unread by a user My current model is as under: CREATE TABLE TIMELINE ( userID varchar, emailID varchar, timestamp bigint, read boolean, PRIMARY KEY (userID, timestamp) ) WITH CLUSTERING ORDER BY (timestamp desc); CREATE INDEX ON TIMELINE (userID, read); The queries I need to support are: SELECT * FROM TIMELINE where userID = 12; SELECT * FROM TIMELINE where userID = 12 order by timestamp asc; SELECT * FROM TIMELINE where userID = 12 and read = true; SELECT * FROM TIMELINE where userID = 12 and read = false; SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc; SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc; *Queries are:* 1. Should I keep read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem. 2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index 3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge. Thanks in advance. Best Regards! Keep Walking, ~ Sandeep
Datastax EC2 Ami
Hi Guys, we start our cassandra cluster with the following ami : ami-ada2b6c4 https://console.aws.amazon.com/ec2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-ada2b6c4 Now we need to add a new node and we realize this ami has cassandra 2.1.4 intead of 2.1.0-2. Is it safe to join this node to the cluster? Or do we need to dowgrade on the new node? Regards Eduardo
Re: Never dropped droppable tombstoned data with LCS
On Sun, Apr 26, 2015 at 1:50 PM, Safa Topal safa.to...@gmail.com wrote: We have a 3 node cluster with Cassandra 2.0.8 version. I am seeing data that should be dropped already. In JMX, I can see that DroppableTombstoneRatio is 0.68 for the column family and the tombstone_threshold was left as default for the CF. We are using LCS on related CF and replication factor of keyspace is set to 3. 2.0.8 contains significant bugs, I would upgrade to the HEAD of 2.0.x ASAP. Regarding non-dropped data : https://issues.apache.org/jira/browse/CASSANDRA-6654 ? We have experienced some downtimes because of repair and for that reason, we are reluctant to run repair again. Consider increasing your gc_grace_seconds to 34 days and running repair once a month, on the first of the month, until you resolve the issue. Not-running repair on a regular schedule will be fatal to consistency of some data. How can we get rid of this tombstoned data without running repair? Or are there any other way to run repair without exhausting the cluster? I have seen a lot about repair -pr, however I am not sure if it will be suitable for our case. Repair -pr should always be used when you are repairing your entire cluster; that's what it's for. How is repair related to the non-purged data? Repair that kills you with tombstones will probably also kill you without tombstones? =Rob