Re: Grouping and storing unordered time series data stream to HDFS
Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead of db) is the better option for intermediate data. Definitely going to read up some more and reconsider due to the time series nature of the data though. This might be a bit out of topic, but in your experience is it common to store intermediate data that will be loaded to Spark plenty of times in the future in Cassandra? Regarding on how late a data can be, I might be able to set the limit. Would you know if it's possible to combine RDDs from different interval in Spark Streaming? Or would I need to write to file first then group the data by time dimension in other batch processing? Thanks in advance! Nisrina. On May 16, 2015 7:26 PM, Helena Edelson helena.edel...@datastax.com wrote: Consider using cassandra with spark streaming and timeseries, cassandra has been doing time series for years. Here’s some snippets with kafka streaming and writing/reading the data back: https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64 or write in the stream, read back https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61 or more detailed reads back https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69 A CassandraInputDStream is coming, i’m working on it now. Helena @helenaedelson On May 15, 2015, at 9:59 AM, ayan guha guha.a...@gmail.com wrote: Hi Do you have a cut off time, like how late an event can be? Else, you may consider a different persistent storage like Cassandra/Hbase and delegate update: part to them. On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati nisrina.luthfiy...@gmail.com wrote: Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time dimension to be in different batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina. -- Best Regards, Ayan Guha
Re: Grouping and storing unordered time series data stream to HDFS
Consider using cassandra with spark streaming and timeseries, cassandra has been doing time series for years. Here’s some snippets with kafka streaming and writing/reading the data back: https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64 https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L62-L64 or write in the stream, read back https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61 https://github.com/killrweather/killrweather/blob/master/killrweather-examples/src/main/scala/com/datastax/killrweather/KafkaStreamingJson2.scala#L53-L61 or more detailed reads back https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69 https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/TemperatureActor.scala#L65-L69 A CassandraInputDStream is coming, i’m working on it now. Helena @helenaedelson On May 15, 2015, at 9:59 AM, ayan guha guha.a...@gmail.com wrote: Hi Do you have a cut off time, like how late an event can be? Else, you may consider a different persistent storage like Cassandra/Hbase and delegate update: part to them. On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati nisrina.luthfiy...@gmail.com mailto:nisrina.luthfiy...@gmail.com wrote: Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time dimension to be in different batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina. -- Best Regards, Ayan Guha
Grouping and storing unordered time series data stream to HDFS
Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time dimension to be in different batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina.
Re: Grouping and storing unordered time series data stream to HDFS
Hi Do you have a cut off time, like how late an event can be? Else, you may consider a different persistent storage like Cassandra/Hbase and delegate update: part to them. On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati nisrina.luthfiy...@gmail.com wrote: Hi all, I have a stream of data from Kafka that I want to process and store in hdfs using Spark Streaming. Each data has a date/time dimension and I want to write data within the same time dimension to the same hdfs directory. The data stream might be unordered (by time dimension). I'm wondering what are the best practices in grouping/storing time series data stream using Spark Streaming? I'm considering grouping each batch of data in Spark Streaming per time dimension and then saving each group to different hdfs directories. However since it is possible for data with the same time dimension to be in different batches, I would need to handle update in case the hdfs directory already exists. Is this a common approach? Are there any other approaches that I can try? Thank you! Nisrina. -- Best Regards, Ayan Guha