You should think about a custom receiver, in order to solve the problem of the “already collected” data. http://spark.apache.org/docs/latest/streaming-custom-receivers.html
Gianluca On 04 Jul 2014, at 15:46, alessandro finamore <alessandro.finam...@polito.it<mailto:alessandro.finam...@polito.it>> wrote: Hi, I have a large dataset of text logs files on which I need to implement "window analysis" Say, extract per-minute data and do aggregated stats on the last X minutes I've to implement the windowing analysis with spark. This is the workflow I'm currently using - read a file and I create a new RDD with per-minute info - loop on each new minute and integrate its data with another data structure containing the last X minutes of data - apply the analysis on the updated window of data This works but it suffer from limited parallelisms Do you have any recommendations/suggestion about a better implementation? Also, are there any recommended data collections for managing the window (I'm simply using Arrays for managing data) While working in this I started to investigate spark streaming. The problem is that I don't know if is really possible to use it on already collected data. This post seems to indicate that it should, but it is not clear to me how http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html Sent from the Apache Spark User List mailing list archive at Nabble.com.