Re: Spark Streaming with long batch / window duration
Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p29005.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming with long batch / window duration
So I think I may end up using hourglass (https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop) a hadoop framework for incremental data processing, it would be very cool if spark (not streaming ) could support something like this -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p10311.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming with long batch / window duration
If you want to process data that spans across weeks, then it best to use a dedicated data store (file system, sql / nosql database, etc.) that is designed for long term data storage and retrieval. Spark Streaming is not designed as a long term data store. Also it does not seem like you need low latency. So it might be better to use a combination of Spark Streaming and Spark programs - Spark Streaming to receive data and store it some long term data store, and Spark to periodically (every hour, day?) pull the data from the store and process them. You can implement the invertible function yourself in Spark by storing the previous reduced values in the same data store every time the spark program is run, and then using that data the next time. The great thing is that both these program can share all the map, and reduce functions. TD On Fri, Jul 18, 2014 at 12:09 PM, aaronjosephs aa...@placeiq.com wrote: Would it be a reasonable use case of spark streaming to have a very large window size (lets say on the scale of weeks). In this particular case the reduce function would be invertible so that would aid in efficiency. I assume that having a larger batch size since the window is so large would also lighten the workload for spark. The sliding duration is not too important, I just want to know if this is reasonable for spark to handle with any slide duration -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark Streaming with long batch / window duration
Unfortunately for reasons I won't go into my options for what I can use are limited, it was more of a curiosity to see if spark could handle a use case like this since the functionality I wanted fit perfectly into the reduceByKeyAndWindow frame of thinking. Anyway thanks for answering. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191p10219.html Sent from the Apache Spark User List mailing list archive at Nabble.com.