We have a sophisticated Spark Streaming application that we have been using
successfully in production for over a year to process a time series of
events.  Our application makes novel use of updateStateByKey() for state
management.

We now have the need to perform exactly the same processing on input data
that's not real-time, but has been persisted to disk.  We do not want to
rewrite our Spark Streaming app unless we have to.

/Might it be possible to perform "large batches" processing on HDFS time
series data using Spark Streaming?/

1.I understand that there is not currently an InputDStream that could do
what's needed.  I would have to create such a thing.
2. Time is a problem.  I would have to use the timestamps on our events for
any time-based logic and state management
3. The "batch duration" would become meaningless in this scenario.  Could I
just set it to something really small (say 1 second) and then let it "fall
behind", processing the data as quickly as it could?

It all seems possible.  But could Spark Streaming work this way?  If I
created a DStream that delivered (say) months of events, could Spark
Streaming effectively process this in a "batch" fashion?

Any and all comments/ideas welcome!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-Spark-Streaming-for-Batch-tp21745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to