Re: Datastax EC2 Ami

2015-04-27 Thread tommaso barbugli
Hi,

I would remove the node and start a new one. You can pick a specific
Cassandra release using user data (eg. --release 2.0.11)

Cheers,
Tommaso

On Mon, Apr 27, 2015 at 8:53 PM, Eduardo Cusa 
eduardo.c...@usmediaconsulting.com wrote:

 Hi Guys, we start our cassandra cluster with the following ami :


 ami-ada2b6c4


 https://console.aws.amazon.com/ec2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-ada2b6c4


 Now we need to add a new node and we realize this ami has cassandra 2.1.4
 intead of 2.1.0-2.

 Is it safe to join this node to the cluster? Or do we need to dowgrade on
 the new node?

 Regards
 Eduardo




RE: Data model suggestions

2015-04-27 Thread Peer, Oded
I recommend truncating the table instead of dropping it since you don’t need to 
re-issue DDL commands and put load on the system keyspace.
Both DROP and TRUNCATE automatically create snapshots, there no “snapshotting” 
advantage for using DROP . See 
http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot


From: Ali Akhtar [mailto:ali.rac...@gmail.com]
Sent: Sunday, April 26, 2015 10:31 PM
To: user@cassandra.apache.org
Subject: Re: Data model suggestions

Thanks Peer. I like the approach you're suggesting.

Why do you recommend truncating the last active table rather than just dropping 
it? Since all the data would be inserted into a new table, seems like it would 
make sense to drop the last table, and that way truncate snapshotting also 
won't have to be dealt with (unless I'm missing anything).

Thanks.


On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded 
oded.p...@rsa.commailto:oded.p...@rsa.com wrote:
I would maintain two tables.
An “archive” table that holds all the active and inactive records, and is 
updated hourly (re-inserting the same record has some compaction overhead but 
on the other side deleting records has tombstones overhead).
An “active” table which holds all the records in the last external API 
invocation.
To avoid tombstones and read-before-delete issues “active” should actually a 
synonym, an alias, to the most recent active table.
I suggest you create two identical tables, “active1” and “active2”, and an 
“active_alias” table that informs which of the two is the most recent.
Thus when you query the external API you insert the data to “archive” and to 
the unaliased “activeN” table, switch the alias value in “active_alias” and 
truncate the new unaliased “activeM” table.
No need to query the data before inserting it. Make sure truncating doesn’t 
create automatic snapshots.


From: Narendra Sharma 
[mailto:narendra.sha...@gmail.commailto:narendra.sha...@gmail.com]
Sent: Friday, April 24, 2015 6:53 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Data model suggestions


I think one table say record should be good. The primary key is record id. This 
will ensure good distribution.
Just update the active attribute to true or false.
For range query on active vs archive records maintain 2 indexes or try 
secondary index.
On Apr 23, 2015 1:32 PM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
Good point about the range selects. I think they can be made to work with 
limits, though. Or, since the active records will never usually be  500k, the 
ids may just be cached in memory.

Most of the time, during reads, the queries will just consist of select * where 
primaryKey = someValue . One row at a time.

The question is just, whether to keep all records in one table (including 
archived records which wont be queried 99% of the time), or to keep active 
records in their own table, and delete them when they're no longer active. Will 
that produce tombstone issues?

On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar 
khangaon...@gmail.commailto:khangaon...@gmail.com wrote:
Hi,
If your external API returns active records, that means I am guessing you need 
to do a select * on the active table to figure out which records in the table 
are no longer active.
You might be aware that range selects based on partition key will timeout in 
cassandra. They can however be made to work using the column cluster key.
To comment more, We would need to see your proposed cassandra tables and 
queries that you might need to run.
regards



On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
That's returned by the external API we're querying. We query them for active 
records, if a previous active record isn't included in the results, that means 
its time to archive that record.

On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar 
khangaon...@gmail.commailto:khangaon...@gmail.com wrote:
Hi,
How do you determine if the record is no longer active ? Is it a perioidic 
process that goes through every record and checks when the last update happened 
?
regards

On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar 
ali.rac...@gmail.commailto:ali.rac...@gmail.com wrote:
Hey all,

We are working on moving a mysql based application to Cassandra.

The workflow in mysql is this: We have two tables: active and archive . Every 
hour, we pull in data from an external API. The records which are active, are 
kept in 'active' table. Once a record is no longer active, its deleted from 
'active' and re-inserted into 'archive'

The purpose for that, is because most of the time, queries are only done 
against the active records rather than archived. Therefore keeping the active 
table small may help with faster queries, if it only has to search 200k records 
vs 3 million or more.

Is it advisable to keep the same data model in Cassandra? I'm concerned about 

Re: Data model suggestions

2015-04-27 Thread Ali Akhtar
Wouldn't truncating the table create tombstones?

On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded oded.p...@rsa.com wrote:

  I recommend truncating the table instead of dropping it since you don’t
 need to re-issue DDL commands and put load on the system keyspace.

 Both DROP and TRUNCATE automatically create snapshots, there no
 “snapshotting” advantage for using DROP . See
 http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Sunday, April 26, 2015 10:31 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Data model suggestions



 Thanks Peer. I like the approach you're suggesting.



 Why do you recommend truncating the last active table rather than just
 dropping it? Since all the data would be inserted into a new table, seems
 like it would make sense to drop the last table, and that way truncate
 snapshotting also won't have to be dealt with (unless I'm missing anything).



 Thanks.





 On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded oded.p...@rsa.com wrote:

 I would maintain two tables.

 An “archive” table that holds all the active and inactive records, and is
 updated hourly (re-inserting the same record has some compaction overhead
 but on the other side deleting records has tombstones overhead).

 An “active” table which holds all the records in the last external API
 invocation.

 To avoid tombstones and read-before-delete issues “active” should actually
 a synonym, an alias, to the most recent active table.

 I suggest you create two identical tables, “active1” and “active2”, and an
 “active_alias” table that informs which of the two is the most recent.

 Thus when you query the external API you insert the data to “archive” and
 to the unaliased “activeN” table, switch the alias value in “active_alias”
 and truncate the new unaliased “activeM” table.

 No need to query the data before inserting it. Make sure truncating
 doesn’t create automatic snapshots.





 *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com]
 *Sent:* Friday, April 24, 2015 6:53 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Data model suggestions



 I think one table say record should be good. The primary key is record id.
 This will ensure good distribution.
 Just update the active attribute to true or false.
 For range query on active vs archive records maintain 2 indexes or try
 secondary index.

 On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Good point about the range selects. I think they can be made to work with
 limits, though. Or, since the active records will never usually be  500k,
 the ids may just be cached in memory.



 Most of the time, during reads, the queries will just consist of select *
 where primaryKey = someValue . One row at a time.



 The question is just, whether to keep all records in one table (including
 archived records which wont be queried 99% of the time), or to keep active
 records in their own table, and delete them when they're no longer active.
 Will that produce tombstone issues?



 On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar khangaon...@gmail.com
 wrote:

 Hi,

 If your external API returns active records, that means I am guessing you
 need to do a select * on the active table to figure out which records in
 the table are no longer active.

 You might be aware that range selects based on partition key will timeout
 in cassandra. They can however be made to work using the column cluster
 key.

 To comment more, We would need to see your proposed cassandra tables and
 queries that you might need to run.

 regards







 On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 That's returned by the external API we're querying. We query them for
 active records, if a previous active record isn't included in the results,
 that means its time to archive that record.



 On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.com
 wrote:

 Hi,

 How do you determine if the record is no longer active ? Is it a perioidic
 process that goes through every record and checks when the last update
 happened ?

 regards



 On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 Hey all,



 We are working on moving a mysql based application to Cassandra.



 The workflow in mysql is this: We have two tables: active and archive .
 Every hour, we pull in data from an external API. The records which are
 active, are kept in 'active' table. Once a record is no longer active, its
 deleted from 'active' and re-inserted into 'archive'



 The purpose for that, is because most of the time, queries are only done
 against the active records rather than archived. Therefore keeping the
 active table small may help with faster queries, if it only has to search
 200k records vs 3 million or more.



 Is it advisable to keep the same data model in Cassandra? I'm concerned
 

is Thrift support, from Cassandra, really mandatory for OpsCenter monitoring ?

2015-04-27 Thread DE VITO Dominique
Hi,

While reading the OpsCenter 5.1 docs, it looks like OpsCenter can't work if 
Cassandra does not provide a Thrift interface (see [1] below).

Is it really the case ?

At first sight, it sounded weird to me, as CQL 3 is provided for months.

Just to know, is a OpsCenter future version, not relying on a mandatory Thrift 
interface, on the road ?

Thanks.

Regards,
Dominique

[1] in the OpsCenter 5.1 guide :

*** Modifying how OpsCenter connects to clusters
Cluster Connection settings define how OpsCenter connects to a cluster.

About this task
The Connection settings for a cluster define how OpsCenter connects to the 
cluster. For example, if you've enabled authentication or encryption on a 
cluster, you'll need to specify that information.

Procedure
1. Select the cluster you want to edit from the Cluster menu.
2. Click Settings  Cluster Connections.
The Edit Cluster dialog appears.
3. Change the IP addresses of cluster nodes.
4. Change JMX and Thrift listen port numbers.
5. Edit the user credentials if the JMX or Thrift ports require authentication.




Re: Data model suggestions

2015-04-27 Thread Laing, Michael
No - it immediately removes the sstables on all nodes.

On Mon, Apr 27, 2015 at 7:53 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 Wouldn't truncating the table create tombstones?

 On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded oded.p...@rsa.com wrote:

  I recommend truncating the table instead of dropping it since you don’t
 need to re-issue DDL commands and put load on the system keyspace.

 Both DROP and TRUNCATE automatically create snapshots, there no
 “snapshotting” advantage for using DROP . See
 http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot





 *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
 *Sent:* Sunday, April 26, 2015 10:31 PM

 *To:* user@cassandra.apache.org
 *Subject:* Re: Data model suggestions



 Thanks Peer. I like the approach you're suggesting.



 Why do you recommend truncating the last active table rather than just
 dropping it? Since all the data would be inserted into a new table, seems
 like it would make sense to drop the last table, and that way truncate
 snapshotting also won't have to be dealt with (unless I'm missing anything).



 Thanks.





 On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded oded.p...@rsa.com wrote:

 I would maintain two tables.

 An “archive” table that holds all the active and inactive records, and is
 updated hourly (re-inserting the same record has some compaction overhead
 but on the other side deleting records has tombstones overhead).

 An “active” table which holds all the records in the last external API
 invocation.

 To avoid tombstones and read-before-delete issues “active” should
 actually a synonym, an alias, to the most recent active table.

 I suggest you create two identical tables, “active1” and “active2”, and
 an “active_alias” table that informs which of the two is the most recent.

 Thus when you query the external API you insert the data to “archive” and
 to the unaliased “activeN” table, switch the alias value in “active_alias”
 and truncate the new unaliased “activeM” table.

 No need to query the data before inserting it. Make sure truncating
 doesn’t create automatic snapshots.





 *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com]
 *Sent:* Friday, April 24, 2015 6:53 AM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Data model suggestions



 I think one table say record should be good. The primary key is record
 id. This will ensure good distribution.
 Just update the active attribute to true or false.
 For range query on active vs archive records maintain 2 indexes or try
 secondary index.

 On Apr 23, 2015 1:32 PM, Ali Akhtar ali.rac...@gmail.com wrote:

 Good point about the range selects. I think they can be made to work with
 limits, though. Or, since the active records will never usually be  500k,
 the ids may just be cached in memory.



 Most of the time, during reads, the queries will just consist of select *
 where primaryKey = someValue . One row at a time.



 The question is just, whether to keep all records in one table (including
 archived records which wont be queried 99% of the time), or to keep active
 records in their own table, and delete them when they're no longer active.
 Will that produce tombstone issues?



 On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar 
 khangaon...@gmail.com wrote:

 Hi,

 If your external API returns active records, that means I am guessing you
 need to do a select * on the active table to figure out which records in
 the table are no longer active.

 You might be aware that range selects based on partition key will timeout
 in cassandra. They can however be made to work using the column cluster
 key.

 To comment more, We would need to see your proposed cassandra tables and
 queries that you might need to run.

 regards







 On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 That's returned by the external API we're querying. We query them for
 active records, if a previous active record isn't included in the results,
 that means its time to archive that record.



 On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar khangaon...@gmail.com
 wrote:

 Hi,

 How do you determine if the record is no longer active ? Is it a
 perioidic process that goes through every record and checks when the last
 update happened ?

 regards



 On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar ali.rac...@gmail.com wrote:

 Hey all,



 We are working on moving a mysql based application to Cassandra.



 The workflow in mysql is this: We have two tables: active and archive .
 Every hour, we pull in data from an external API. The records which are
 active, are kept in 'active' table. Once a record is no longer active, its
 deleted from 'active' and re-inserted into 'archive'



 The purpose for that, is because most of the time, queries are only done
 against the active records rather than archived. Therefore keeping the
 active table small may help with faster queries, if it 

Re: is Thrift support, from Cassandra, really mandatory for OpsCenter monitoring ?

2015-04-27 Thread Michael Shuler

On 04/27/2015 08:18 AM, DE VITO Dominique wrote:

Just to know, is a OpsCenter future version, not relying on a mandatory
Thrift interface, on the road ?


Yes.

--
Kind regards,
Michael



Re: Best Practice to add a node in a Cluster

2015-04-27 Thread Neha Trivedi
Thanks Eric and Matt :) !!

Yes the purpose is to improve reliability.
Right now, from our driver we are querying using degradePolicy for
reliability.



*For changing the keyspace for RF=3, the procedure is as under:*
1. Add a new node to the cluster (new node is not in seed list)

2. ALTER KEYSPACE system_auth WITH REPLICATION =
  {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};


   1. On each affected node, run nodetool repair
   
http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html.

   2. Wait until repair completes on a node, then move to the next node.


Any other things to take care?

Thanks
Regards
neha


On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote:

 It depends on why you're adding a new node.  If you're running out of disk
 space or IO capacity in your 2 node cluster, then changing RF to 3 will not
 improve either condition - you'd still be writing all data to all three
 nodes.

 However if you're looking to improve reliability, a 2 node RF=2 cluster
 cannot have either node offline without losing quorum, while a 3 node RF=3
 cluster can have one node offline and still be able to achieve quorum.
 RF=3 is a common replication factor because of this characteristic.

 Make sure your new node is not in its own seeds list, or it will not
 bootstrap (it will come online immediately and start serving requests).

 On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com
 wrote:

 Hi
 We have a 2 Cluster Node with RF=2. We are planing to add a new node.

 Should we change RF to 3 in the schema?
 OR Just added a new node with the same RF=2?

 Any other Best Practice that we need to take care?

 Thanks
 regards
 Neha





Re: Best Practice to add a node in a Cluster

2015-04-27 Thread arun sirimalla
Hi Neha,


After you add the node to the cluster, run nodetool cleanup on all nodes.
Next running repair on each node will replicate the data. Make sure you run
the repair on one node at a time, because repair is an expensive process
(Utilizes high CPU).




On Mon, Apr 27, 2015 at 8:36 PM, Neha Trivedi nehajtriv...@gmail.com
wrote:

 Thanks Eric and Matt :) !!

 Yes the purpose is to improve reliability.
 Right now, from our driver we are querying using degradePolicy for
 reliability.



 *For changing the keyspace for RF=3, the procedure is as under:*
 1. Add a new node to the cluster (new node is not in seed list)

 2. ALTER KEYSPACE system_auth WITH REPLICATION =
   {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};


1. On each affected node, run nodetool repair

 http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html.

2. Wait until repair completes on a node, then move to the next node.


 Any other things to take care?

 Thanks
 Regards
 neha


 On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote:

 It depends on why you're adding a new node.  If you're running out of
 disk space or IO capacity in your 2 node cluster, then changing RF to 3
 will not improve either condition - you'd still be writing all data to all
 three nodes.

 However if you're looking to improve reliability, a 2 node RF=2 cluster
 cannot have either node offline without losing quorum, while a 3 node RF=3
 cluster can have one node offline and still be able to achieve quorum.
 RF=3 is a common replication factor because of this characteristic.

 Make sure your new node is not in its own seeds list, or it will not
 bootstrap (it will come online immediately and start serving requests).

 On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com
 wrote:

 Hi
 We have a 2 Cluster Node with RF=2. We are planing to add a new node.

 Should we change RF to 3 in the schema?
 OR Just added a new node with the same RF=2?

 Any other Best Practice that we need to take care?

 Thanks
 regards
 Neha






-- 
Arun
Senior Hadoop/Cassandra Engineer
Cloudwick

Champion of Big Data (Cloudera)
http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html

2014 Data Impact Award Winner (Cloudera)
http://www.cloudera.com/content/cloudera/en/campaign/data-impact-awards.html


minimum bandwidth requirement between two Geo Redundant sites of Cassandra database

2015-04-27 Thread Gaurav Bhatnagar
Hi,
 Is there any minimum bandwidth requirement between two Geo Redundant
data centres?
What is the minimum latency that link between two Geo Redundant data
centres should have to get best efficient operations?

Regards,
Gaurav


New node got stuck joining the cluster after a while

2015-04-27 Thread Analia Lorenzatto
Hello guys,

I have a cluster comprised of 2 nodes, configured with vnodes.  Using
2.1.0-2 version of cassandra.

And I am facing an issue when I want to joing a new node to the cluster.

At first starting joining but then it got stuck:

UN  1x.x.x.x  348.11 GB  256 100.0%  1c
UN  1x.x.x.x  342.74 GB  256 100.0%  1c
UJ  1x.x.x.x  26.86 GB   256 ?   1c


I can see some errors on the already working nodes:

*WARN  [SharedPool-Worker-7] 2015-04-27 17:41:16,060
SliceQueryFilter.java:236 - Read 5001 live and 66548 tombstoned cells in
usmc.userpixel (see tombstone_warn_threshol*
*d). 5000 columns was requested, slices=[-],
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647
2147483647}*
*WARN  [SharedPool-Worker-32] 2015-04-27 17:41:16,668
SliceQueryFilter.java:236 - Read 2012 live and 30440 tombstoned cells in
usmc.userpixel (see tombstone_warn_thresho*
*ld). 5001 columns was requested,
slices=[b6d051df-0a8f-4c13-b93c-1b4ff0d82b8d:date-],
delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647}*

*ERROR [CompactionExecutor:35638] 2015-04-27 19:06:07,613
CassandraDaemon.java:166 - Exception in thread
Thread[CompactionExecutor:35638,1,main]*
*java.lang.AssertionError: Memory was freed*
*at
org.apache.cassandra.io.util.Memory.checkPosition(Memory.java:281)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at org.apache.cassandra.io.util.Memory.getInt(Memory.java:233)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.io.sstable.IndexSummary.getPositionInSummary(IndexSummary.java:118)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.io.sstable.IndexSummary.getKey(IndexSummary.java:123)
~[apache-cassandra-2.1.0.jar:2.1.0]*
* at
org.apache.cassandra.io.sstable.IndexSummary.binarySearch(IndexSummary.java:92)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.io.sstable.SSTableReader.getSampleIndexesForRanges(SSTableReader.java:1209)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.io.sstable.SSTableReader.estimatedKeysForRanges(SSTableReader.java:1165)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.db.compaction.AbstractCompactionStrategy.worthDroppingTombstones(AbstractCompactionStrategy.java:328)
~[apache-cassandra-2.1.0.jar:2.1.0*
*]*
*at
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.findDroppableSSTable(LeveledCompactionStrategy.java:365)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:127)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:112)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:229)
~[apache-cassandra-2.1.0.jar:2.1.0]*
*at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
~[na:1.7.0_51]*
*at java.util.concurrent.FutureTask.run(FutureTask.java:262)
~[na:1.7.0_51]*
*at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[na:1.7.0_51]*
*at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_51]*
*at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]*

But I do not see any warning or error message in logs of the joining
nodes.  I just see an exception there when I run: nodetool info:

root@:~# nodetool info
ID   : f5e49647-59fa-474f-b6af-9f65abc43581
Gossip active: true
Thrift active: false
Native Transport active: false
Load : 26.86 GB
Generation No: 1430163258
Uptime (seconds) : 18799
Heap Memory (MB) : 4185.15 / 7566.00
error: null
-- StackTrace --
java.lang.AssertionError
at
org.apache.cassandra.locator.TokenMetadata.getTokens(TokenMetadata.java:440)
at
org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2079)
at
org.apache.cassandra.service.StorageService.getTokens(StorageService.java:2068)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:75)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:279)
at
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at

Re: Best Practice to add a node in a Cluster

2015-04-27 Thread Neha Trivedi
Thans Arun !

On Tue, Apr 28, 2015 at 9:44 AM, arun sirimalla arunsi...@gmail.com wrote:

 Hi Neha,


 After you add the node to the cluster, run nodetool cleanup on all nodes.
 Next running repair on each node will replicate the data. Make sure you
 run the repair on one node at a time, because repair is an expensive
 process (Utilizes high CPU).




 On Mon, Apr 27, 2015 at 8:36 PM, Neha Trivedi nehajtriv...@gmail.com
 wrote:

 Thanks Eric and Matt :) !!

 Yes the purpose is to improve reliability.
 Right now, from our driver we are querying using degradePolicy for
 reliability.



 *For changing the keyspace for RF=3, the procedure is as under:*
 1. Add a new node to the cluster (new node is not in seed list)

 2. ALTER KEYSPACE system_auth WITH REPLICATION =
   {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};


1. On each affected node, run nodetool repair

 http://docs.datastax.com/en/cassandra/1.2/cassandra/tools/toolsNodetool_r.html.

2. Wait until repair completes on a node, then move to the next node.


 Any other things to take care?

 Thanks
 Regards
 neha


 On Mon, Apr 27, 2015 at 9:45 PM, Eric Stevens migh...@gmail.com wrote:

 It depends on why you're adding a new node.  If you're running out of
 disk space or IO capacity in your 2 node cluster, then changing RF to 3
 will not improve either condition - you'd still be writing all data to all
 three nodes.

 However if you're looking to improve reliability, a 2 node RF=2 cluster
 cannot have either node offline without losing quorum, while a 3 node RF=3
 cluster can have one node offline and still be able to achieve quorum.
 RF=3 is a common replication factor because of this characteristic.

 Make sure your new node is not in its own seeds list, or it will not
 bootstrap (it will come online immediately and start serving requests).

 On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com
 wrote:

 Hi
 We have a 2 Cluster Node with RF=2. We are planing to add a new node.

 Should we change RF to 3 in the schema?
 OR Just added a new node with the same RF=2?

 Any other Best Practice that we need to take care?

 Thanks
 regards
 Neha






 --
 Arun
 Senior Hadoop/Cassandra Engineer
 Cloudwick

 Champion of Big Data (Cloudera)

 http://www.cloudera.com/content/dev-center/en/home/champions-of-big-data.html

 2014 Data Impact Award Winner (Cloudera)

 http://www.cloudera.com/content/cloudera/en/campaign/data-impact-awards.html




Best Practice to add a node in a Cluster

2015-04-27 Thread Neha Trivedi
Hi
We have a 2 Cluster Node with RF=2. We are planing to add a new node.

Should we change RF to 3 in the schema?
OR Just added a new node with the same RF=2?

Any other Best Practice that we need to take care?

Thanks
regards
Neha


RE: Best Practice to add a node in a Cluster

2015-04-27 Thread Matthew Johnson
Hi Neha,



I guess it depends why you are adding a new node – do you need more storage
capacity, do you want better resilience, or are you trying to increase
performance?



If you add a new node with the same amount of storage as the previous two,
but you increase the RF, you will use up all of the storage you have added
by replicating the existing data onto the new node. If you keep it at RF=2,
once you have done all the bootstrapping and cleanup then your usage on the
existing two should decrease by about 30% (of their total size).



However, if it is resilience you are after (being able to take down nodes
without losing availability) then increasing the RF will give you this, at
the expense of using more storage.



Hope that helps.



Cheers,

Matt





*From:* Neha Trivedi [mailto:nehajtriv...@gmail.com]
*Sent:* 27 April 2015 16:46
*To:* user@cassandra.apache.org
*Subject:* Best Practice to add a node in a Cluster



Hi

We have a 2 Cluster Node with RF=2. We are planing to add a new node.

Should we change RF to 3 in the schema?
OR Just added a new node with the same RF=2?

Any other Best Practice that we need to take care?

Thanks

regards

Neha


Re: Best Practice to add a node in a Cluster

2015-04-27 Thread Eric Stevens
It depends on why you're adding a new node.  If you're running out of disk
space or IO capacity in your 2 node cluster, then changing RF to 3 will not
improve either condition - you'd still be writing all data to all three
nodes.

However if you're looking to improve reliability, a 2 node RF=2 cluster
cannot have either node offline without losing quorum, while a 3 node RF=3
cluster can have one node offline and still be able to achieve quorum.
RF=3 is a common replication factor because of this characteristic.

Make sure your new node is not in its own seeds list, or it will not
bootstrap (it will come online immediately and start serving requests).

On Mon, Apr 27, 2015 at 8:46 AM, Neha Trivedi nehajtriv...@gmail.com
wrote:

 Hi
 We have a 2 Cluster Node with RF=2. We are planing to add a new node.

 Should we change RF to 3 in the schema?
 OR Just added a new node with the same RF=2?

 Any other Best Practice that we need to take care?

 Thanks
 regards
 Neha




Fwd: Data Modelling Help

2015-04-27 Thread Sandeep Gupta
Hi,

I am a newbie with Cassandra and thus need data modelling help as I haven't
found a resource that tackles the same problem.

The user case is similar to an email-system. I want to store a timeline of
all emails a user has received and then fetch them back with three
different ways:

1. All emails ever received
2. Mails that have been read by a user
3. Mails that are still unread by a user

My current model is as under:

CREATE TABLE TIMELINE (
userID varchar,
emailID varchar,
timestamp bigint,
read boolean,
PRIMARY KEY (userID, timestamp)
) WITH CLUSTERING ORDER BY (timestamp desc);

CREATE INDEX ON TIMELINE (userID, read);

The queries I need to support are:

SELECT * FROM TIMELINE where userID = 12;
SELECT * FROM TIMELINE where userID = 12 order by timestamp asc;
SELECT * FROM TIMELINE where userID = 12 and read = true;
SELECT * FROM TIMELINE where userID = 12 and read = false;
SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp
asc;
SELECT * FROM TIMELINE where userID = 12 and read = false order by
timestamp asc;


*Queries are:*

1. Should I keep  read as my secondary index as It will be frequently
updated and can create tombstones - per
http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a
problem.

2. Can we do inequality check on secondary index because i found out that
atleast one equality condition should be present on secondary index

3. If this is not the right way to model, please suggest on how to support
the above queries. Maintaining three different tables worries me about the
number of insertions (for read/unread) as number of users * emails viewed
per day will be huge.


Thanks in advance.

Best Regards!





Keep Walking,
~ Sandeep


Datastax EC2 Ami

2015-04-27 Thread Eduardo Cusa
Hi Guys, we start our cassandra cluster with the following ami :


ami-ada2b6c4

https://console.aws.amazon.com/ec2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-ada2b6c4


Now we need to add a new node and we realize this ami has cassandra 2.1.4
intead of 2.1.0-2.

Is it safe to join this node to the cluster? Or do we need to dowgrade on
the new node?

Regards
Eduardo


Re: Never dropped droppable tombstoned data with LCS

2015-04-27 Thread Robert Coli
On Sun, Apr 26, 2015 at 1:50 PM, Safa Topal safa.to...@gmail.com wrote:

 We have a 3 node cluster with Cassandra 2.0.8 version. I am seeing data
 that should be dropped already. In JMX, I can see that
 DroppableTombstoneRatio is 0.68 for the column family and the
 tombstone_threshold was left as default for the CF. We are using LCS on
 related CF and replication factor of keyspace is set to 3.


2.0.8 contains significant bugs, I would upgrade to the HEAD of 2.0.x ASAP.

Regarding non-dropped data :

https://issues.apache.org/jira/browse/CASSANDRA-6654

?


 We have experienced some downtimes because of repair and for that
 reason, we are reluctant to run repair again.


Consider increasing your gc_grace_seconds to 34 days and running repair
once a month, on the first of the month, until you resolve the issue.
Not-running repair on a regular schedule will be fatal to consistency of
some data.


 How can we get rid of this tombstoned data without running repair? Or are
 there any other way to run repair without exhausting the cluster? I have
 seen a lot about repair -pr, however I am not sure if it will be suitable
 for our case.


Repair -pr should always be used when you are repairing your entire
cluster; that's what it's for.

How is repair related to the non-purged data? Repair that kills you with
tombstones will probably also kill you without tombstones?

=Rob