Another alternative could be use SparkStreaming's textFileStream with windowing capabilities.
On Friday, July 4, 2014 9:52 AM, Gianluca Privitera <gianluca.privite...@studio.unibo.it> wrote: You should think about a custom receiver, in order to solve the problem of the “already collected” data. http://spark.apache.org/docs/latest/streaming-custom-receivers.html Gianluca On 04 Jul 2014, at 15:46, alessandro finamore <alessandro.finam...@polito.it> wrote: Hi, > >I have a large dataset of text logs files on which I need to implement >"window analysis" >Say, extract per-minute data and do aggregated stats on the last X minutes > >I've to implement the windowing analysis with spark. >This is the workflow I'm currently using >- read a file and I create a new RDD with per-minute info >- loop on each new minute and integrate its data with another data structure >containing the last X minutes of data >- apply the analysis on the updated window of data > >This works but it suffer from limited parallelisms >Do you have any recommendations/suggestion about a better implementation? >Also, are there any recommended data collections for managing the window >(I'm simply using Arrays for managing data) > >While working in this I started to investigate spark streaming. >The problem is that I don't know if is really possible to use it on already >collected data. >This post seems to indicate that it should, but it is not clear to me how >http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html > >Thanks > > > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. >