Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
Hi Ayan and Helena,

I've considered using Cassandra/HBase but ended up opting to save to worker
hdfs because I want to take advantage of the data locality since the data
will often be loaded to Spark for further processing. I was also under the
impression that saving to filesystem (instead of db) is the better option
for intermediate data. Definitely going to read up some more and reconsider
due to the time series nature of the data though.

This might be a bit out of topic, but in your experience is it common to
store intermediate data that will be loaded to Spark plenty of times in the
future in Cassandra?

Regarding on how late a data can be, I might be able to set the limit.
Would you know if it's possible to combine RDDs from different interval in
Spark Streaming? Or would I need to write to file first then group the data
by time dimension in other batch processing?

Thanks in advance!
Nisrina.
 On May 16, 2015 7:26 PM, Helena Edelson helena.edel...@datastax.com
wrote:

 Consider using cassandra with spark streaming and timeseries, cassandra
 has been doing time series for years.
 Here’s some snippets with kafka streaming and writing/reading the data
 back:


 https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64

 or write in the stream, read back

 https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61

 or more detailed reads back

 https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69



 A CassandraInputDStream is coming, i’m working on it now.

 Helena
 @helenaedelson

 On May 15, 2015, at 9:59 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 Do you have a cut off time, like how late an event can be? Else, you may
 consider a different persistent storage like Cassandra/Hbase and delegate
 update: part to them.

 On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati 
 nisrina.luthfiy...@gmail.com wrote:


 Hi all,
 I have a stream of data from Kafka that I want to process and store in
 hdfs using Spark Streaming.
 Each data has a date/time dimension and I want to write data within the
 same time dimension to the same hdfs directory. The data stream might be
 unordered (by time dimension).

 I'm wondering what are the best practices in grouping/storing time series
 data stream using Spark Streaming?

 I'm considering grouping each batch of data in Spark Streaming per time
 dimension and then saving each group to different hdfs directories. However
 since it is possible for data with the same time dimension to be in
 different batches, I would need to handle update in case the hdfs
 directory already exists.

 Is this a common approach? Are there any other approaches that I can try?

 Thank you!
 Nisrina.




 --
 Best Regards,
 Ayan Guha





Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
Consider using cassandra with spark streaming and timeseries, cassandra has 
been doing time series for years.
Here’s some snippets with kafka streaming and writing/reading the data back:

https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64
 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64

or write in the stream, read back
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61
 
https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61

or more detailed reads back
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69
 
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69
 


A CassandraInputDStream is coming, i’m working on it now.

Helena
@helenaedelson

 On May 15, 2015, at 9:59 AM, ayan guha guha.a...@gmail.com wrote:
 
 Hi
 
 Do you have a cut off time, like how late an event can be? Else, you may 
 consider a different persistent storage like Cassandra/Hbase and delegate 
 update: part to them. 
 
 On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati 
 nisrina.luthfiy...@gmail.com mailto:nisrina.luthfiy...@gmail.com wrote:
 
 Hi all,
 I have a stream of data from Kafka that I want to process and store in hdfs 
 using Spark Streaming.
 Each data has a date/time dimension and I want to write data within the same 
 time dimension to the same hdfs directory. The data stream might be unordered 
 (by time dimension).
 
 I'm wondering what are the best practices in grouping/storing time series 
 data stream using Spark Streaming?
 
 I'm considering grouping each batch of data in Spark Streaming per time 
 dimension and then saving each group to different hdfs directories. However 
 since it is possible for data with the same time dimension to be in different 
 batches, I would need to handle update in case the hdfs directory already 
 exists.
 
 Is this a common approach? Are there any other approaches that I can try?
 
 Thank you!
 Nisrina.
 
 
 
 -- 
 Best Regards,
 Ayan Guha



Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread Nisrina Luthfiyati
Hi all,
I have a stream of data from Kafka that I want to process and store in hdfs
using Spark Streaming.
Each data has a date/time dimension and I want to write data within the
same time dimension to the same hdfs directory. The data stream might be
unordered (by time dimension).

I'm wondering what are the best practices in grouping/storing time series
data stream using Spark Streaming?

I'm considering grouping each batch of data in Spark Streaming per time
dimension and then saving each group to different hdfs directories. However
since it is possible for data with the same time dimension to be in
different batches, I would need to handle update in case the hdfs
directory already exists.

Is this a common approach? Are there any other approaches that I can try?

Thank you!
Nisrina.


Re: Grouping and storing unordered time series data stream to HDFS

2015-05-15 Thread ayan guha
Hi

Do you have a cut off time, like how late an event can be? Else, you may
consider a different persistent storage like Cassandra/Hbase and delegate
update: part to them.

On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati 
nisrina.luthfiy...@gmail.com wrote:


 Hi all,
 I have a stream of data from Kafka that I want to process and store in
 hdfs using Spark Streaming.
 Each data has a date/time dimension and I want to write data within the
 same time dimension to the same hdfs directory. The data stream might be
 unordered (by time dimension).

 I'm wondering what are the best practices in grouping/storing time series
 data stream using Spark Streaming?

 I'm considering grouping each batch of data in Spark Streaming per time
 dimension and then saving each group to different hdfs directories. However
 since it is possible for data with the same time dimension to be in
 different batches, I would need to handle update in case the hdfs
 directory already exists.

 Is this a common approach? Are there any other approaches that I can try?

 Thank you!
 Nisrina.




-- 
Best Regards,
Ayan Guha