org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:218 throws java.lang.AssertionError

2015-11-09 Thread 李建奇
Hi, All,

 

  We have a 12 nodes cluster with 2.1.9 version for near one month. Last
week it have an exception . Cluster’s write and read latency will go up to
4 seconds from 0.4ms average after exception.

  I suspect OutboundTcpConnection is broken .I try to disablegossip then
enablegossip to rescue OutboundTcpConnction ,but fail .

  I have to restart every node to restore cluster to normal . Cluster’s
load is low. The client use datastax java driver 2.1.7.1 . CF with RF 3. 

  

  Question:

 Which situation to trigger this AssertionError?   I read
OutboundTcpConnection.java source code,  line 228 comments
“writeConnected() is reasonably robust”.

  

Thanks

 

 

Attachment :

 

ERROR [MessagingService-Outgoing-/172.20.114.13] 2015-11-08 10:36:28,763
CassandraDaemon.java:223 - Exception in thread
Thread[MessagingService-Outgoing-/172.20.114.13,5,main]

java.lang.AssertionError: 78251

at
org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUti
l.java:290) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(Abstra
ctCType.java:392) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(Abstra
ctCType.java:381) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.
java:271) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.
java:259) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQu
eryFilter.java:503) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQu
eryFilter.java:490) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.SliceFromReadCommandSerializer.serialize(SliceFromRe
adCommand.java:168) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:143
) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:132
) ~[apache-cassandra-2.1.9.jar:2.1.9]

at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:121)
~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConn
ection.java:330) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpCon
nection.java:282) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.jav
a:218) ~[apache-cassandra-2.1.9.jar:2.1.9]

ERROR [MessagingService-Outgoing-/172.20.114.19] 2015-11-08 10:36:28,763
CassandraDaemon.java:223 - Exception in thread
Thread[MessagingService-Outgoing-/172.20.114.19,5,main]

java.lang.AssertionError: 78251

at
org.apache.cassandra.utils.ByteBufferUtil.writeWithShortLength(ByteBufferUti
l.java:290) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(Abstra
ctCType.java:392) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.composites.AbstractCType$Serializer.serialize(Abstra
ctCType.java:381) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.
java:271) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.ColumnSlice$Serializer.serialize(ColumnSlice.
java:259) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQu
eryFilter.java:503) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.filter.SliceQueryFilter$Serializer.serialize(SliceQu
eryFilter.java:490) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.SliceFromReadCommandSerializer.serialize(SliceFromRe
adCommand.java:168) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:143
) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.db.ReadCommandSerializer.serialize(ReadCommand.java:132
) ~[apache-cassandra-2.1.9.jar:2.1.9]

at org.apache.cassandra.net.MessageOut.serialize(MessageOut.java:121)
~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.writeInternal(OutboundTcpConn
ection.java:330) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpCon
nection.java:282) ~[apache-cassandra-2.1.9.jar:2.1.9]

at
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.jav
a:218) ~[apache-cassandra-2.1.9.jar:2.1.9]

INFO  [NativePoolCleaner] 2015-11-08 10:36:47,414
ColumnFamilyStore.java:1231 - Flushing largest CFS(Keyspace='qinglong',
ColumnFamily='package_state') to free up 

[RELEASE] Apache Cassandra 3.0.0 released

2015-11-09 Thread Jake Luciani
The Cassandra team is pleased to announce the release of Apache Cassandra
version 3.0.0.

Top Cassandra 3.0 features:

  * CQL optimized storage engine and sstable format
  * Materialized views
  * More efficient hints

Read more about features and upgrade instructions in NEWS.txt[2]

The Java driver beta for 3.0.0 will be officially released within the next
week.  In the meantime,
use the version included in the release under /lib.

The Python driver rc has been released as '3.0.0rc1'

Apache Cassandra is a fully distributed database. It is the right choice
when you need scalability and high availability without compromising
performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download
section:

 http://cassandra.apache.org/download/

This version is a first release[1] on the 3.0 series. As always, please pay
attention to the release notes[2] and Let us know[3] if you were to
encounter
any problem.

Enjoy!

[1]: http://goo.gl/TduZdw (CHANGES.txt)
[2]: http://goo.gl/mJxdHZ (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA


Re: Unable to bootstrap another DC in my cluster

2015-11-09 Thread Robert Coli
On Mon, Nov 9, 2015 at 12:08 PM, K F  wrote:

> As I am trying to bring up a new DC in my cluster, my first seed node that
> I bring-up in the new DC that I am adding to the existing cluster. It's not
> able to receive reply back for the GossipDigestSyn request sent to other
> seeds in the cluster.
>
> This is causing the first node to come-up as a standalone node in the
> cluster. So, how do I debug further this situation?
>

That shouldn't be the case in 2.0.14, IIRC. It should just fail to start if
you have provided a seed that it cannot contact.

Are you sure the node is not in its own seed list?

https://issues.apache.org/jira/browse/CASSANDRA-5836

=Rob


Re: Cassandra compaction stuck? Should I disable?

2015-11-09 Thread Robert Coli
On Mon, Nov 9, 2015 at 1:29 PM, PenguinWhispererThe . <
th3penguinwhispe...@gmail.com> wrote:
>
> In Opscenter I see one of the nodes is orange. It seems like it's working
> on compaction. I used nodetool compactionstats and whenever I did this the
> Completed nad percentage stays the same (even with hours in between).
>
Are you the same person from IRC, or a second report today of compaction
hanging in this way?

What version of Cassandra?

> I currently don't see cpu load from cassandra on that node. So it seems
> stuck (somewhere mid 60%). Also some other nodes have compaction on the
> same columnfamily. I don't see any progress.
>
>  WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677 
> ColumnFamilyStore.java (line 2101) Unable to cancel in-progress compactions 
> for usage_record_ptd.  Probably there is an unusually large row in progress 
> somewhere.  It is also possible that buggy code left some sstables compacting 
> after it was done with them
>
>
>- How can I assure that nothing is happening?
>
> Find the thread that is doing compaction and strace it. Generally it is
one of the threads with a lower thread priority.

Compaction often appears hung when decompressing a very large row, but
usually not for "hours".

>
>- Is it recommended to disable compaction from a certain data size? (I
>believe 25GB on each node).
>
> It is almost never recommended to disable compaction.

>
>- Can I stop this compaction? nodetool stop compaction doesn't seem to
>work.
>
> Killing the JVM ("the dungeon collapses!") would certainly stop it, but
it'd likely just start again when you restart the node.

>
>- Is stopping the compaction dangerous?
>
>  Not if you're in a version that properly cleans up partial compactions,
which is most of them.

>
>- Is killing the cassandra process dangerous while compacting(I did
>nodetool drain on one node)?
>
> No. But probably nodetool drain couldn't actually stop the in-progress
compaction either, FWIW.

> This is output of nodetool compactionstats grepped for the keyspace that
> seems stuck.
>
> Do you have gigantic rows in that keyspace? What does cfstats say about
the largest row compaction has seen/do you have log messages about
compacting large rows?

> I also see frequently lines like this in system.log:
>
> WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886 
> BatchStatement.java (line 223) Batch of prepared statements for 
> [billing.usage_record_by_billing_period, billing.metric] is of size 53086, 
> exceeding specified threshold of 5120 by 47966.
>
>
Unrelated.

=Rob


Fwd: Cassandra compaction stuck? Should I disable?

2015-11-09 Thread PenguinWhispererThe .
Hi all,

In Opscenter I see one of the nodes is orange. It seems like it's working
on compaction. I used nodetool compactionstats and whenever I did this the
Completed nad percentage stays the same (even with hours in between). I
currently don't see cpu load from cassandra on that node. So it seems stuck
(somewhere mid 60%). Also some other nodes have compaction on the same
columnfamily. I don't see any progress.

 WARN [RMI TCP Connection(554)-192.168.0.68] 2015-11-09 17:18:13,677
ColumnFamilyStore.java (line 2101) Unable to cancel in-progress
compactions for usage_record_ptd.  Probably there is an unusually
large row in progress somewhere.  It is also possible that buggy code
left some sstables compacting after it was done with them


   - How can I assure that nothing is happening?
   - Is it recommended to disable compaction from a certain data size? (I
   believe 25GB on each node).
   - Can I stop this compaction? nodetool stop compaction doesn't seem to
   work.
   - Is stopping the compaction dangerous?
   - Is killing the cassandra process dangerous while compacting(I did
   nodetool drain on one node)?


This is output of nodetool compactionstats grepped for the keyspace that
seems stuck.

4e48f940-86c6-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447062197972 52321301
16743606   {1:2, 4:248}
94acec50-86c8-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447063175061 48992375
13420862   {3:3, 4:245}
3210c9b0-8707-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447090067915 52763216
17732003   {1:2, 4:248}
24f96fe0-86ce-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447065564638 44909171
17029440   {1:2, 3:39, 4:209}
06d58370-86ef-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447079687463 53570365
17873962   {1:2, 3:2, 4:246}
f7aa5fa0-86c7-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447062911642 47701016
13291915   {3:2, 4:246}
806a4380-86f7-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447083327416 52644411
17363023   {1:2, 2:1, 4:247}
c845b900-86c5-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447061973136 48944530
16698191   {1:2, 3:6, 4:242}
bb44a0b0-8718-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447097599547 48768463
13518523   {2:2, 3:5, 4:242}
f2c17ea0-86c3-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447061185418 90367799
13904914   {5:4, 6:7, 7:52, 8:185}
1aae6590-86ce-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447065547369 53190698
17228121   {1:2, 4:248}
d7ca8d00-86d5-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447068871120 52422499
16995963   {1:2, 3:3, 4:245}
6e890290-86df-11e5-96be-dd3c9e46ec74 mykeyspace
mycolumnfamily 1447072989497 45218168
17174468   {1:2, 3:21, 4:227}

I also see frequently lines like this in system.log:

WARN [Native-Transport-Requests:11935] 2015-11-09 20:10:41,886
BatchStatement.java (line 223) Batch of prepared statements for
[billing.usage_record_by_billing_period, billing.metric] is of size
53086, exceeding specified threshold of 5120 by 47966.


Any other remarks? Thanks a lot in advance!


Unable to bootstrap another DC in my cluster

2015-11-09 Thread K F
Hi folks,
As I am trying to bring up a new DC in my cluster, my first seed node that I 
bring-up in the new DC that I am adding to the existing cluster. It's not able 
to receive reply back for the GossipDigestSyn request sent to other seeds in 
the cluster.
This is causing the first node to come-up as a standalone node in the cluster. 
So, how do I debug further this situation?
I am running 2.0.14 version of cassandra. Is this a known issue with this 
release?
Thanks.

Re: How to organize a timeseries by device?

2015-11-09 Thread Kai Wang
bucket key is just like any column of the table, you can use any type as
long as it's convenient for you to write the query.

But I don't think you should use 5 minute as your bucket key since you only
have 1 event every 5 minute. 5-minute bucket seems too small. The bucket
key we mentioned is for you to break the (device_id, timestamp) partitions
into ones with size between ~1MB to ~10MB.

On Mon, Nov 9, 2015 at 11:50 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> Is it usually recommended to use the bucket key (usually an 5 minutes
> period in my case) for the table of the events_by_time using a timestamp or
> a string?
>
> On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang  wrote:
>
>> it depends on the size of each event. You want to bound each partition
>> under ~10MB. In system.log look for entry like:
>>
>> WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
>> SSTableWriter.java:240 - Compacting large partition
>> :9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)
>>
>> This is the warning sign that you have large partitions. The threshold is
>> defined by compaction_large_partition_warning_threshold_mb in
>> cassandra.yaml. The default is 100MB.
>>
>> You can also use nodetool cfstats to check partition size.
>>
>> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> For the first table: (device_id, timestamp), should I add a bucket even
>>> if I know I might have millions of events per device but never billions?
>>>
>>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky >> > wrote:
>>>
 Cassandra is good at two kinds of queries: 1) access a specific row by
 a specific key, and 2) Access a slice or consecutive sequence of rows
 within a given partition.

 It is recommended to avoid ALLOW FILTERING. If it happens to work well
 for you, great, go for it, but if it doesn't then simply don't do it. Best
 to redesign your data model to play to Cassandra's strengths.

 If you bucket the time-based table, do a separate query for each time
 bucket.

 -- Jack Krupansky

 On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
 guilla...@databerries.com> wrote:

> Kai, Jack,
>
> On 1., should the bucket be a STRING with a date format or do I have a
> better option ? For (device_id, bucket, timestamp), did you mean
> ((device_id, bucket), timestamp) ?
>
> On 2., what are the risks of timeout ? I currently have this warning:
> "Cannot execute this query as it might involve data filtering and thus may
> have unpredictable performance. If you want to execute this query despite
> the performance unpredictability, use ALLOW FILTERING".
>
> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:
>
>> 1. Don't make your partition unbound. It's tempting to just use
>> (device_id, timestamp). But soon or later you will have problem when time
>> goes by. You can keep the partition bound by using (device_id, bucket,
>> timestamp). Use hour, day, month or even year like Jack mentioned 
>> depending
>> on the size of data.
>>
>> 2. As to your specific query, for a given partition and a time range,
>> C* doesn't need to load the whole partition then filter. It only 
>> retrieves
>> the slice within the time range from disk because the data is clustered 
>> by
>> timestamp.
>>
>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
>> jack.krupan...@gmail.com> wrote:
>>
>>> The general rule in Cassandra data modeling is to look at all of
>>> your queries first and then to declare a table for each query, even if 
>>> that
>>> means storing multiple copies of the data. So, create a second table 
>>> with
>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>> interval makes sense to give 1 to 10 megabytes per partition) and time 
>>> and
>>> device as the clustering keys.
>>>
>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>> Lucene plugin.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
 Hello,

 We are currently storing geolocation events (about 1 per 5 minutes)
 for each device we track. We currently have 2 TB of data. I would like 
 to
 store the device_id, the timestamp of the event, latitude and 
 longitude. I
 though about using the device_id as the partition key and timestamp as 
 the
 clustering column. It is great as events are naturally grouped by 
 device
 (very useful for our Spark jobs). However, if I would like to retrieve 
 all
 events of all devices of the last week I understood that Cassandra will

Re: Do I have to use the cql in the datastax java driver?

2015-11-09 Thread Robert Coli
On Sun, Nov 8, 2015 at 6:57 AM, Jonathan Haddad  wrote:

> You shouldn't use thrift, it's effectively dead.
>


> On Fri, Nov 6, 2015 at 10:30 PM Dikang Gu  wrote:
>
>> Can I still use thrift interface to talk to cassandra? Any reason that we
>> should not use thrift anymore?
>>
>
I agree with Jonathan.

In my opinion, Thrift is highly likely to eventually be removed from
Cassandra. I recommend that operators of new projects not use it.

=Rob


Re: Best way to recreate a cassandra node with data

2015-11-09 Thread Robert Coli
On Sun, Nov 8, 2015 at 9:11 PM, John Wong  wrote:

> If we recreate an instance with the same IP, what is the best way to get
> the node up and running with the previous data? Right now I am relying on
> backup.
>

replace_address if you don't mind decreasing unique replica count by one.

https://github.com/JeremyGrosser/tablesnap

If you do care about decreasing unique replica count by one.


> I was hoping that we can stream the data, but nodetool rebuild is for
> bringing up a new data center. I just re-created the instance and I don't
> see much going on except keyspaces are being re-created with some data
> file. I thought Cassandra would automatically stream data from replicas...
>

Note that :

1) in order for replace_address to work, you need to set initial_token in
the conf file
2) the node can't be in its own seed list, or it can't bootstrap

=Rob


Re: How to organize a timeseries by device?

2015-11-09 Thread Guillaume Charhon
Is it usually recommended to use the bucket key (usually an 5 minutes
period in my case) for the table of the events_by_time using a timestamp or
a string?

On Mon, Nov 9, 2015 at 5:05 PM, Kai Wang  wrote:

> it depends on the size of each event. You want to bound each partition
> under ~10MB. In system.log look for entry like:
>
> WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
> SSTableWriter.java:240 - Compacting large partition
> :9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)
>
> This is the warning sign that you have large partitions. The threshold is
> defined by compaction_large_partition_warning_threshold_mb in
> cassandra.yaml. The default is 100MB.
>
> You can also use nodetool cfstats to check partition size.
>
> On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> For the first table: (device_id, timestamp), should I add a bucket even
>> if I know I might have millions of events per device but never billions?
>>
>> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky 
>> wrote:
>>
>>> Cassandra is good at two kinds of queries: 1) access a specific row by a
>>> specific key, and 2) Access a slice or consecutive sequence of rows within
>>> a given partition.
>>>
>>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>>> to redesign your data model to play to Cassandra's strengths.
>>>
>>> If you bucket the time-based table, do a separate query for each time
>>> bucket.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
 Kai, Jack,

 On 1., should the bucket be a STRING with a date format or do I have a
 better option ? For (device_id, bucket, timestamp), did you mean
 ((device_id, bucket), timestamp) ?

 On 2., what are the risks of timeout ? I currently have this warning:
 "Cannot execute this query as it might involve data filtering and thus may
 have unpredictable performance. If you want to execute this query despite
 the performance unpredictability, use ALLOW FILTERING".

 On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:

> 1. Don't make your partition unbound. It's tempting to just use
> (device_id, timestamp). But soon or later you will have problem when time
> goes by. You can keep the partition bound by using (device_id, bucket,
> timestamp). Use hour, day, month or even year like Jack mentioned 
> depending
> on the size of data.
>
> 2. As to your specific query, for a given partition and a time range,
> C* doesn't need to load the whole partition then filter. It only retrieves
> the slice within the time range from disk because the data is clustered by
> timestamp.
>
> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
>
>> The general rule in Cassandra data modeling is to look at all of your
>> queries first and then to declare a table for each query, even if that
>> means storing multiple copies of the data. So, create a second table with
>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>> interval makes sense to give 1 to 10 megabytes per partition) and time 
>> and
>> device as the clustering keys.
>>
>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>> Lucene plugin.
>>
>> -- Jack Krupansky
>>
>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> Hello,
>>>
>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>> for each device we track. We currently have 2 TB of data. I would like 
>>> to
>>> store the device_id, the timestamp of the event, latitude and 
>>> longitude. I
>>> though about using the device_id as the partition key and timestamp as 
>>> the
>>> clustering column. It is great as events are naturally grouped by device
>>> (very useful for our Spark jobs). However, if I would like to retrieve 
>>> all
>>> events of all devices of the last week I understood that Cassandra will
>>> need to load all data and filter which does not seems to be clean on the
>>> long term.
>>>
>>> How should I create my model?
>>>
>>> Best Regards
>>>
>>
>>
>

>>>
>>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Kai Wang
it depends on the size of each event. You want to bound each partition
under ~10MB. In system.log look for entry like:

WARN  [CompactionExecutor:39] 2015-11-07 17:32:00,019
SSTableWriter.java:240 - Compacting large partition
:9f80ce31-b7e7-40c7-b642-f5d03fc320aa (13443863224 bytes)

This is the warning sign that you have large partitions. The threshold is
defined by compaction_large_partition_warning_threshold_mb in
cassandra.yaml. The default is 100MB.

You can also use nodetool cfstats to check partition size.

On Mon, Nov 9, 2015 at 10:53 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> For the first table: (device_id, timestamp), should I add a bucket even
> if I know I might have millions of events per device but never billions?
>
> On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky 
> wrote:
>
>> Cassandra is good at two kinds of queries: 1) access a specific row by a
>> specific key, and 2) Access a slice or consecutive sequence of rows within
>> a given partition.
>>
>> It is recommended to avoid ALLOW FILTERING. If it happens to work well
>> for you, great, go for it, but if it doesn't then simply don't do it. Best
>> to redesign your data model to play to Cassandra's strengths.
>>
>> If you bucket the time-based table, do a separate query for each time
>> bucket.
>>
>> -- Jack Krupansky
>>
>> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> Kai, Jack,
>>>
>>> On 1., should the bucket be a STRING with a date format or do I have a
>>> better option ? For (device_id, bucket, timestamp), did you mean
>>> ((device_id, bucket), timestamp) ?
>>>
>>> On 2., what are the risks of timeout ? I currently have this warning:
>>> "Cannot execute this query as it might involve data filtering and thus may
>>> have unpredictable performance. If you want to execute this query despite
>>> the performance unpredictability, use ALLOW FILTERING".
>>>
>>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:
>>>
 1. Don't make your partition unbound. It's tempting to just use
 (device_id, timestamp). But soon or later you will have problem when time
 goes by. You can keep the partition bound by using (device_id, bucket,
 timestamp). Use hour, day, month or even year like Jack mentioned depending
 on the size of data.

 2. As to your specific query, for a given partition and a time range,
 C* doesn't need to load the whole partition then filter. It only retrieves
 the slice within the time range from disk because the data is clustered by
 timestamp.

 On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <
 jack.krupan...@gmail.com> wrote:

> The general rule in Cassandra data modeling is to look at all of your
> queries first and then to declare a table for each query, even if that
> means storing multiple copies of the data. So, create a second table with
> bucketed time as the partition key (hour, 15 minutes, or whatever time
> interval makes sense to give 1 to 10 megabytes per partition) and time and
> device as the clustering keys.
>
> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
> you want using Solr. Or Stratio or TupleJump Stargate for an open source
> Lucene plugin.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Hello,
>>
>> We are currently storing geolocation events (about 1 per 5 minutes)
>> for each device we track. We currently have 2 TB of data. I would like to
>> store the device_id, the timestamp of the event, latitude and longitude. 
>> I
>> though about using the device_id as the partition key and timestamp as 
>> the
>> clustering column. It is great as events are naturally grouped by device
>> (very useful for our Spark jobs). However, if I would like to retrieve 
>> all
>> events of all devices of the last week I understood that Cassandra will
>> need to load all data and filter which does not seems to be clean on the
>> long term.
>>
>> How should I create my model?
>>
>> Best Regards
>>
>
>

>>>
>>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Guillaume Charhon
For the first table: (device_id, timestamp), should I add a bucket even if
I know I might have millions of events per device but never billions?

On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky 
wrote:

> Cassandra is good at two kinds of queries: 1) access a specific row by a
> specific key, and 2) Access a slice or consecutive sequence of rows within
> a given partition.
>
> It is recommended to avoid ALLOW FILTERING. If it happens to work well for
> you, great, go for it, but if it doesn't then simply don't do it. Best to
> redesign your data model to play to Cassandra's strengths.
>
> If you bucket the time-based table, do a separate query for each time
> bucket.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Kai, Jack,
>>
>> On 1., should the bucket be a STRING with a date format or do I have a
>> better option ? For (device_id, bucket, timestamp), did you mean
>> ((device_id, bucket), timestamp) ?
>>
>> On 2., what are the risks of timeout ? I currently have this warning:
>> "Cannot execute this query as it might involve data filtering and thus may
>> have unpredictable performance. If you want to execute this query despite
>> the performance unpredictability, use ALLOW FILTERING".
>>
>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:
>>
>>> 1. Don't make your partition unbound. It's tempting to just use
>>> (device_id, timestamp). But soon or later you will have problem when time
>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>>> on the size of data.
>>>
>>> 2. As to your specific query, for a given partition and a time range, C*
>>> doesn't need to load the whole partition then filter. It only retrieves the
>>> slice within the time range from disk because the data is clustered by
>>> timestamp.
>>>
>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky >> > wrote:
>>>
 The general rule in Cassandra data modeling is to look at all of your
 queries first and then to declare a table for each query, even if that
 means storing multiple copies of the data. So, create a second table with
 bucketed time as the partition key (hour, 15 minutes, or whatever time
 interval makes sense to give 1 to 10 megabytes per partition) and time and
 device as the clustering keys.

 Or, consider DSE SEarch  and then you can do whatever ad hoc queries
 you want using Solr. Or Stratio or TupleJump Stargate for an open source
 Lucene plugin.

 -- Jack Krupansky

 On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
 guilla...@databerries.com> wrote:

> Hello,
>
> We are currently storing geolocation events (about 1 per 5 minutes)
> for each device we track. We currently have 2 TB of data. I would like to
> store the device_id, the timestamp of the event, latitude and longitude. I
> though about using the device_id as the partition key and timestamp as the
> clustering column. It is great as events are naturally grouped by device
> (very useful for our Spark jobs). However, if I would like to retrieve all
> events of all devices of the last week I understood that Cassandra will
> need to load all data and filter which does not seems to be clean on the
> long term.
>
> How should I create my model?
>
> Best Regards
>


>>>
>>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Jack Krupansky
Cassandra is good at two kinds of queries: 1) access a specific row by a
specific key, and 2) Access a slice or consecutive sequence of rows within
a given partition.

It is recommended to avoid ALLOW FILTERING. If it happens to work well for
you, great, go for it, but if it doesn't then simply don't do it. Best to
redesign your data model to play to Cassandra's strengths.

If you bucket the time-based table, do a separate query for each time
bucket.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> Kai, Jack,
>
> On 1., should the bucket be a STRING with a date format or do I have a
> better option ? For (device_id, bucket, timestamp), did you mean
> ((device_id, bucket), timestamp) ?
>
> On 2., what are the risks of timeout ? I currently have this warning:
> "Cannot execute this query as it might involve data filtering and thus may
> have unpredictable performance. If you want to execute this query despite
> the performance unpredictability, use ALLOW FILTERING".
>
> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:
>
>> 1. Don't make your partition unbound. It's tempting to just use
>> (device_id, timestamp). But soon or later you will have problem when time
>> goes by. You can keep the partition bound by using (device_id, bucket,
>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>> on the size of data.
>>
>> 2. As to your specific query, for a given partition and a time range, C*
>> doesn't need to load the whole partition then filter. It only retrieves the
>> slice within the time range from disk because the data is clustered by
>> timestamp.
>>
>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky 
>> wrote:
>>
>>> The general rule in Cassandra data modeling is to look at all of your
>>> queries first and then to declare a table for each query, even if that
>>> means storing multiple copies of the data. So, create a second table with
>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>> device as the clustering keys.
>>>
>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
>>> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
>>> plugin.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
 Hello,

 We are currently storing geolocation events (about 1 per 5 minutes) for
 each device we track. We currently have 2 TB of data. I would like to store
 the device_id, the timestamp of the event, latitude and longitude. I though
 about using the device_id as the partition key and timestamp as the
 clustering column. It is great as events are naturally grouped by device
 (very useful for our Spark jobs). However, if I would like to retrieve all
 events of all devices of the last week I understood that Cassandra will
 need to load all data and filter which does not seems to be clean on the
 long term.

 How should I create my model?

 Best Regards

>>>
>>>
>>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Guillaume Charhon
Kai, Jack,

On 1., should the bucket be a STRING with a date format or do I have a
better option ? For (device_id, bucket, timestamp), did you mean
((device_id, bucket), timestamp) ?

On 2., what are the risks of timeout ? I currently have this warning:
"Cannot execute this query as it might involve data filtering and thus may
have unpredictable performance. If you want to execute this query despite
the performance unpredictability, use ALLOW FILTERING".

On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang  wrote:

> 1. Don't make your partition unbound. It's tempting to just use
> (device_id, timestamp). But soon or later you will have problem when time
> goes by. You can keep the partition bound by using (device_id, bucket,
> timestamp). Use hour, day, month or even year like Jack mentioned depending
> on the size of data.
>
> 2. As to your specific query, for a given partition and a time range, C*
> doesn't need to load the whole partition then filter. It only retrieves the
> slice within the time range from disk because the data is clustered by
> timestamp.
>
> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky 
> wrote:
>
>> The general rule in Cassandra data modeling is to look at all of your
>> queries first and then to declare a table for each query, even if that
>> means storing multiple copies of the data. So, create a second table with
>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>> device as the clustering keys.
>>
>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
>> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
>> plugin.
>>
>> -- Jack Krupansky
>>
>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>> guilla...@databerries.com> wrote:
>>
>>> Hello,
>>>
>>> We are currently storing geolocation events (about 1 per 5 minutes) for
>>> each device we track. We currently have 2 TB of data. I would like to store
>>> the device_id, the timestamp of the event, latitude and longitude. I though
>>> about using the device_id as the partition key and timestamp as the
>>> clustering column. It is great as events are naturally grouped by device
>>> (very useful for our Spark jobs). However, if I would like to retrieve all
>>> events of all devices of the last week I understood that Cassandra will
>>> need to load all data and filter which does not seems to be clean on the
>>> long term.
>>>
>>> How should I create my model?
>>>
>>> Best Regards
>>>
>>
>>
>


Re: Best way to recreate a cassandra node with data

2015-11-09 Thread Eric Stevens
Check nodetool status to see if the replacement node is fully joined (UN
status).  If it is and it didn't stream any data, then either
auto_bootstrap was false, or the node was in its own seeds list.  If you
lost a node, then replace_address as Jonny mentioned would probably be a
good idea.

On Mon, Nov 9, 2015 at 1:31 AM Johnny Miller 
wrote:

> John - Why not just just follow the process for replacing a dead node? Why
> do you need to use the same IP? e.g. JVM_OPTS="$JVM_OPTS
> -Dcassandra.replace_address=address_of_dead_node
>
>
> http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html
>
>
> Johnny
>
>
> On 9 Nov 2015, at 05:11, John Wong  wrote:
>
> Hi.
> We are running Cassandra 1.2.19, and we are AWS customer, so we store our
> data in ephemeral storage.
>
> If we recreate an instance with the same IP, what is the best way to get
> the node up and running with the previous data? Right now I am relying on
> backup.
>
> I was hoping that we can stream the data, but nodetool rebuild is for
> bringing up a new data center. I just re-created the instance and I don't
> see much going on except keyspaces are being re-created with some data
> file. I thought Cassandra would automatically stream data from replicas...
>
> Ideas?
> Thanks.
>
> John
>
>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Kai Wang
1. Don't make your partition unbound. It's tempting to just use (device_id,
timestamp). But soon or later you will have problem when time goes by. You
can keep the partition bound by using (device_id, bucket, timestamp). Use
hour, day, month or even year like Jack mentioned depending on the size of
data.

2. As to your specific query, for a given partition and a time range, C*
doesn't need to load the whole partition then filter. It only retrieves the
slice within the time range from disk because the data is clustered by
timestamp.

On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky 
wrote:

> The general rule in Cassandra data modeling is to look at all of your
> queries first and then to declare a table for each query, even if that
> means storing multiple copies of the data. So, create a second table with
> bucketed time as the partition key (hour, 15 minutes, or whatever time
> interval makes sense to give 1 to 10 megabytes per partition) and time and
> device as the clustering keys.
>
> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
> plugin.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
> guilla...@databerries.com> wrote:
>
>> Hello,
>>
>> We are currently storing geolocation events (about 1 per 5 minutes) for
>> each device we track. We currently have 2 TB of data. I would like to store
>> the device_id, the timestamp of the event, latitude and longitude. I though
>> about using the device_id as the partition key and timestamp as the
>> clustering column. It is great as events are naturally grouped by device
>> (very useful for our Spark jobs). However, if I would like to retrieve all
>> events of all devices of the last week I understood that Cassandra will
>> need to load all data and filter which does not seems to be clean on the
>> long term.
>>
>> How should I create my model?
>>
>> Best Regards
>>
>
>


Re: Can't save Opscenter Dashboard

2015-11-09 Thread Kai Wang
Finally I got this one resolved. I sent a feedback via Help->Feedback on
OpsCenter page. Someone is actually reading those - imagine that. Big +1 to
Datastax. Here is the fix:

first visit this URL: http://your_ip:your_port
/Test_Cluster/rc/dashboard_presets/
you should get a response like this:
{"838ef1a3-9d49-41ff-84e3-4d96440487e5": {}}
Then visit another URL:
curl -X "DELETE" http://http://your_ip:your_port
/Test_Cluster/rc/dashboard_presets/838ef1a3-9d49-41ff-84e3-4d96440487e5

This will clear out the broken dashboard settings and allow you to
reconfigure the dashboard again.

On Thu, Nov 5, 2015 at 10:02 AM, Kai Wang  wrote:

> It happens again after I reboot another node. This time I see errors in
> agent.log. It seems to be related to the previous dead node.
>
>   INFO [clojure-agent-send-off-pool-2] 2015-11-05 09:48:41,602 Attempting
> to load stored metric values.
>  ERROR [clojure-agent-send-off-pool-2] 2015-11-05 09:48:41,613 There was
> an error when attempting to load stored rollups.
>  com.datastax.driver.core.exceptions.DriverInternalError: Unexpected error
> while processing response from /x.x.x.x:9042
> at
> com.datastax.driver.core.exceptions.DriverInternalError.copy(DriverInternalError.java:42)
> at
> com.datastax.driver.core.exceptions.DriverInternalError.copy(DriverInternalError.java:24)
> ...
> Caused by: com.datastax.driver.core.exceptions.DriverInternalError:
> Unexpected error while processing response from /x.x.x.x:9042
> at
> com.datastax.driver.core.DefaultResultSetFuture.onSet(DefaultResultSetFuture.java:150)
> at
> com.datastax.driver.core.RequestHandler.setFinalResult(RequestHandler.java:183)
> at
> com.datastax.driver.core.RequestHandler.access$2300(RequestHandler.java:45)
> ...
> Caused by: java.lang.IllegalStateException: Can't use this cluster
> instance because it was previously closed
> at com.datastax.driver.core.Cluster.checkNotClosed(Cluster.java:493)
> at com.datastax.driver.core.Cluster.access$400(Cluster.java:61)
> at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1231)
> ...
> INFO [clojure-agent-send-off-pool-1] 2015-11-05 09:48:41,618 Attempting to
> load stored metric values.
>  ERROR [clojure-agent-send-off-pool-1] 2015-11-05 09:48:41,622 There was
> an error when attempting to load stored rollups.
>  com.datastax.driver.core.exceptions.InvalidQueryException: Invalid null
> value for partition key part key
> at
> com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)
> at
> com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:291)
>
>
> On Wed, Nov 4, 2015 at 8:43 PM, qihuang.zheng <
> qihuang.zh...@fraudmetrix.cn> wrote:
>
>> We have this problem with version 5.2.0.  so we decide to update to 5.2.2
>>
>> But this problem seems remain.  We solve this by totally delele relate
>> agent file and process and restart. just like first time install.
>>
>>
>> sudo kill -9 `ps -ef|grep datastax_agent_monitor | head -1 |awk '{print
>> $2}'` && \
>>
>> sudo kill -9 `cat /var/run/datastax-agent/datastax-agent.pid` && \
>>
>> sudo rm -rf /var/lib/datastax-agent && \
>>
>> sudo rm -rf /usr/share/datastax-agent
>>
>> --
>> qihuang.zheng
>>
>>  原始邮件
>> *发件人:* Kai Wang
>> *收件人:* user
>> *发送时间:* 2015年11月5日(周四) 04:39
>> *主题:* Can't save Opscenter Dashboard
>>
>> Hi,
>>
>> Today after one of the nodes is rebooted, OpsCenter dashboard doesn't
>> save anymore. It starts with an empty dashboard with no widget or graph. If
>> I add some graph/widget, they are being updated fine. But if I refresh the
>> browser, the dashboard became empty again.
>>
>> Also there's no "DEFAULT" tab on the dashboard as the user guide shows. I
>> am not sure if it was there before.
>>
>
>


Re: How to organize a timeseries by device?

2015-11-09 Thread Jack Krupansky
The general rule in Cassandra data modeling is to look at all of your
queries first and then to declare a table for each query, even if that
means storing multiple copies of the data. So, create a second table with
bucketed time as the partition key (hour, 15 minutes, or whatever time
interval makes sense to give 1 to 10 megabytes per partition) and time and
device as the clustering keys.

Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
plugin.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon  wrote:

> Hello,
>
> We are currently storing geolocation events (about 1 per 5 minutes) for
> each device we track. We currently have 2 TB of data. I would like to store
> the device_id, the timestamp of the event, latitude and longitude. I though
> about using the device_id as the partition key and timestamp as the
> clustering column. It is great as events are naturally grouped by device
> (very useful for our Spark jobs). However, if I would like to retrieve all
> events of all devices of the last week I understood that Cassandra will
> need to load all data and filter which does not seems to be clean on the
> long term.
>
> How should I create my model?
>
> Best Regards
>


How to organize a timeseries by device?

2015-11-09 Thread Guillaume Charhon
Hello,

We are currently storing geolocation events (about 1 per 5 minutes) for
each device we track. We currently have 2 TB of data. I would like to store
the device_id, the timestamp of the event, latitude and longitude. I though
about using the device_id as the partition key and timestamp as the
clustering column. It is great as events are naturally grouped by device
(very useful for our Spark jobs). However, if I would like to retrieve all
events of all devices of the last week I understood that Cassandra will
need to load all data and filter which does not seems to be clean on the
long term.

How should I create my model?

Best Regards


which astyanax version to use?

2015-11-09 Thread Lu, Boying
Hi, All,

We plan to upgrade Cassandra from 2.0.17 to 2.1.11 (the latest stable release 
recommended to be used in the product environment) in our product.
Currently we are using Astyanax 1.56.49 as Java client, I found there are many 
new Astyanax at https://github.com/Netflix/astyanax/releases
So which version should we use in a product environment 3.8.0?

Thanks

Boying



Re: Does nodetool cleanup clears tombstones in the CF?

2015-11-09 Thread Johnny Miller
You could also have a look at the JMX forceUserDefinedCompaction call on a 
specific SSTable

> On 5 Nov 2015, at 21:56, K F  wrote:
> 
> Thanks Rob, I will look into checksstablegarbage utility. However, I don't 
> want to run major compaction as that would result in too big of a sstable.
> 
> Regards,
> K F
> 
> From: Robert Coli 
> To: "user@cassandra.apache.org" ; K F 
>  
> Sent: Thursday, November 5, 2015 1:53 PM
> Subject: Re: Does nodetool cleanup clears tombstones in the CF?
> 
> 
> 
> On Wed, Nov 4, 2015 at 12:56 PM, K F  > wrote:
> Quick question, in order for me to purge tombstones on particular nodes if I 
> run nodetool cleanup   will that help in 
> purging the tombstones from that node?
> 
> cleanup is for removing data from ranges the node no longer owns.
> 
> It is unrelated to tombstones.
> 
> There are various approaches to cleaning up tombstones. A simple (if manual) 
> one is to use "checksstablegarbage" and user defined compaction. Even simpler 
> is to run a major compaction, but this has some downsides.
> 
> =Rob
>  
> 
> 



Re: Best way to recreate a cassandra node with data

2015-11-09 Thread Johnny Miller
John - Why not just just follow the process for replacing a dead node? Why do 
you need to use the same IP? e.g. JVM_OPTS="$JVM_OPTS 
-Dcassandra.replace_address=address_of_dead_node

http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_replace_node_t.html
 



Johnny


> On 9 Nov 2015, at 05:11, John Wong  wrote:
> 
> Hi.
> We are running Cassandra 1.2.19, and we are AWS customer, so we store our 
> data in ephemeral storage.
> 
> If we recreate an instance with the same IP, what is the best way to get the 
> node up and running with the previous data? Right now I am relying on backup.
> 
> I was hoping that we can stream the data, but nodetool rebuild is for 
> bringing up a new data center. I just re-created the instance and I don't see 
> much going on except keyspaces are being re-created with some data file. I 
> thought Cassandra would automatically stream data from replicas...
> 
> Ideas?
> Thanks.
> 
> John