Hi Do you have a cut off time, like how "late" an event can be? Else, you may consider a different persistent storage like Cassandra/Hbase and delegate "update: part to them.
On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati < nisrina.luthfiy...@gmail.com> wrote: > > Hi all, > I have a stream of data from Kafka that I want to process and store in > hdfs using Spark Streaming. > Each data has a date/time dimension and I want to write data within the > same time dimension to the same hdfs directory. The data stream might be > unordered (by time dimension). > > I'm wondering what are the best practices in grouping/storing time series > data stream using Spark Streaming? > > I'm considering grouping each batch of data in Spark Streaming per time > dimension and then saving each group to different hdfs directories. However > since it is possible for data with the same time dimension to be in > different batches, I would need to handle "update" in case the hdfs > directory already exists. > > Is this a common approach? Are there any other approaches that I can try? > > Thank you! > Nisrina. > -- Best Regards, Ayan Guha