Re: window analysis with Spark and Spark streaming
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq, PhD candidatte, KTH, Stockholm. On Sunday, July 6, 2014 10:20 AM, alessandro finamore wrote: On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] <[hidden email]> wrote: > Key idea is to simulate your app time as you enter data . So you can connect > spark streaming to a queue and insert data in it spaced by time. Easier said > than done :). I see. I'll try to implement also this solution so that I can compare it with my current spark implementation. I'm interested in seeing if this is faster...as I assume it should be :) > What are the parallelism issues you are hitting with your > static approach. In my current spark implementation, whenever I need to get the aggregated stats over the window, I'm re-mapping all the current bins to have the same key so that they can be reduced altogether. This means that data need to shipped to a single reducer. As results, adding nodes/cores to the application does not really affect the total time :( > > > On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote: >> >> Thanks for the replies >> >> What is not completely clear to me is how time is managed. >> I can create a DStream from file. >> But if I set the window property that will be bounded to the application >> time, right? >> >> If I got it right, with a receiver I can control the way DStream are >> created. >> But, how can apply then the windowing already shipped with the framework >> if >> this is bounded to the "application time"? >> I would like to do define a window of N files but the window() function >> requires a duration as input... >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html >> >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > -- > Sent from Gmail Mobile > > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html > To unsubscribe from window analysis with Spark and Spark streaming, click > here. > NAML -- -- Alessandro Finamore, PhD Politecnico di Torino -- Office: +39 0115644127 Mobile: +39 3280251485 SkypeId: alessandro.finamore --- View this message in context: Re: window analysis with Spark and Spark streaming Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
On 5 July 2014 23:08, Mayur Rustagi [via Apache Spark User List] wrote: > Key idea is to simulate your app time as you enter data . So you can connect > spark streaming to a queue and insert data in it spaced by time. Easier said > than done :). I see. I'll try to implement also this solution so that I can compare it with my current spark implementation. I'm interested in seeing if this is faster...as I assume it should be :) > What are the parallelism issues you are hitting with your > static approach. In my current spark implementation, whenever I need to get the aggregated stats over the window, I'm re-mapping all the current bins to have the same key so that they can be reduced altogether. This means that data need to shipped to a single reducer. As results, adding nodes/cores to the application does not really affect the total time :( > > > On Friday, July 4, 2014, alessandro finamore <[hidden email]> wrote: >> >> Thanks for the replies >> >> What is not completely clear to me is how time is managed. >> I can create a DStream from file. >> But if I set the window property that will be bounded to the application >> time, right? >> >> If I got it right, with a receiver I can control the way DStream are >> created. >> But, how can apply then the windowing already shipped with the framework >> if >> this is bounded to the "application time"? >> I would like to do define a window of N files but the window() function >> requires a duration as input... >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html >> >> Sent from the Apache Spark User List mailing list archive at Nabble.com. > > > > -- > Sent from Gmail Mobile > > > > If you reply to this email, your message will be added to the discussion > below: > http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8860.html > To unsubscribe from window analysis with Spark and Spark streaming, click > here. > NAML -- -- Alessandro Finamore, PhD Politecnico di Torino -- Office:+39 0115644127 Mobile: +39 3280251485 SkypeId: alessandro.finamore --- -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8867.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Key idea is to simulate your app time as you enter data . So you can connect spark streaming to a queue and insert data in it spaced by time. Easier said than done :). What are the parallelism issues you are hitting with your static approach. On Friday, July 4, 2014, alessandro finamore wrote: > Thanks for the replies > > What is not completely clear to me is how time is managed. > I can create a DStream from file. > But if I set the window property that will be bounded to the application > time, right? > > If I got it right, with a receiver I can control the way DStream are > created. > But, how can apply then the windowing already shipped with the framework if > this is bounded to the "application time"? > I would like to do define a window of N files but the window() function > requires a duration as input... > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Sent from Gmail Mobile
Re: window analysis with Spark and Spark streaming
The windowing capabilities of spark streaming determine the events in the RDD created for that time window. If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream. On Friday, July 4, 2014 1:28 PM, alessandro finamore wrote: Thanks for the replies What is not completely clear to me is how time is managed. I can create a DStream from file. But if I set the window property that will be bounded to the application time, right? If I got it right, with a receiver I can control the way DStream are created. But, how can apply then the windowing already shipped with the framework if this is bounded to the "application time"? I would like to do define a window of N files but the window() function requires a duration as input... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Thanks for the replies What is not completely clear to me is how time is managed. I can create a DStream from file. But if I set the window property that will be bounded to the application time, right? If I got it right, with a receiver I can control the way DStream are created. But, how can apply then the windowing already shipped with the framework if this is bounded to the "application time"? I would like to do define a window of N files but the window() function requires a duration as input... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806p8824.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: window analysis with Spark and Spark streaming
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities. On Friday, July 4, 2014 9:52 AM, Gianluca Privitera wrote: You should think about a custom receiver, in order to solve the problem of the “already collected” data. http://spark.apache.org/docs/latest/streaming-custom-receivers.html Gianluca On 04 Jul 2014, at 15:46, alessandro finamore wrote: Hi, > >I have a large dataset of text logs files on which I need to implement >"window analysis" >Say, extract per-minute data and do aggregated stats on the last X minutes > >I've to implement the windowing analysis with spark. >This is the workflow I'm currently using >- read a file and I create a new RDD with per-minute info >- loop on each new minute and integrate its data with another data structure >containing the last X minutes of data >- apply the analysis on the updated window of data > >This works but it suffer from limited parallelisms >Do you have any recommendations/suggestion about a better implementation? >Also, are there any recommended data collections for managing the window >(I'm simply using Arrays for managing data) > >While working in this I started to investigate spark streaming. >The problem is that I don't know if is really possible to use it on already >collected data. >This post seems to indicate that it should, but it is not clear to me how >http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html > >Thanks > > > > > >-- >View this message in context: >http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html >Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: window analysis with Spark and Spark streaming
You should think about a custom receiver, in order to solve the problem of the “already collected” data. http://spark.apache.org/docs/latest/streaming-custom-receivers.html Gianluca On 04 Jul 2014, at 15:46, alessandro finamore mailto:alessandro.finam...@polito.it>> wrote: Hi, I have a large dataset of text logs files on which I need to implement "window analysis" Say, extract per-minute data and do aggregated stats on the last X minutes I've to implement the windowing analysis with spark. This is the workflow I'm currently using - read a file and I create a new RDD with per-minute info - loop on each new minute and integrate its data with another data structure containing the last X minutes of data - apply the analysis on the updated window of data This works but it suffer from limited parallelisms Do you have any recommendations/suggestion about a better implementation? Also, are there any recommended data collections for managing the window (I'm simply using Arrays for managing data) While working in this I started to investigate spark streaming. The problem is that I don't know if is really possible to use it on already collected data. This post seems to indicate that it should, but it is not clear to me how http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-windowing-Driven-by-absolutely-time-td1733.html Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/window-analysis-with-Spark-and-Spark-streaming-tp8806.html Sent from the Apache Spark User List mailing list archive at Nabble.com.