Hi

Do you have a cut off time, like how "late" an event can be? Else, you may
consider a different persistent storage like Cassandra/Hbase and delegate
"update: part to them.

On Fri, May 15, 2015 at 8:10 PM, Nisrina Luthfiyati <
nisrina.luthfiy...@gmail.com> wrote:

>
> Hi all,
> I have a stream of data from Kafka that I want to process and store in
> hdfs using Spark Streaming.
> Each data has a date/time dimension and I want to write data within the
> same time dimension to the same hdfs directory. The data stream might be
> unordered (by time dimension).
>
> I'm wondering what are the best practices in grouping/storing time series
> data stream using Spark Streaming?
>
> I'm considering grouping each batch of data in Spark Streaming per time
> dimension and then saving each group to different hdfs directories. However
> since it is possible for data with the same time dimension to be in
> different batches, I would need to handle "update" in case the hdfs
> directory already exists.
>
> Is this a common approach? Are there any other approaches that I can try?
>
> Thank you!
> Nisrina.
>



-- 
Best Regards,
Ayan Guha

Reply via email to