You should think about a custom receiver, in order to solve the problem of the 
“already collected” data.
http://spark.apache.org/docs/latest/streaming-custom-receivers.html

Gianluca

On 04 Jul 2014, at 15:46, alessandro finamore 
<alessandro.finam...@polito.it<mailto:alessandro.finam...@polito.it>> wrote:

Hi,

I have a large dataset of text logs files on which I need to implement
"window analysis"
Say, extract per-minute data and do aggregated stats on the last X minutes

I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD with per-minute info
- loop on each new minute and integrate its data with another data structure
containing the last X minutes of data
- apply the analysis on the updated window of data

This works but it suffer from limited parallelisms
Do you have any recommendations/suggestion about a better implementation?
Also, are there any recommended data collections for managing the window
(I'm simply using Arrays for managing data)

While working in this I started to investigate spark streaming.
The problem is that I don't know if is really possible to use it on already
collected data.
This post seems to indicate that it should, but it is not clear to me how
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html

Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to