Calling collect on anything is almost always a bad idea. The only exception is if you are looking to pass that data on to any other system & never see it again :) . I would say you need to implement outlier detection on the rdd & process it in spark itself rather than calling collect on it.
Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <eko.harmawan.sus...@gmail.com> wrote: > Hi All, > > I have a problem that i would like to consult about spark streaming. > > I have a spark streaming application that parse a file (which will be > growing as time passed by)This file contains several columns containing > lines of numbers, > these parsing is divided into windows (each 1 minute). Each column > represent different entity while each row within a column represent the > same entity (for example, first column represent temprature, second column > represent humidty, etc, while each row represent the value of each > attribute). I use PairDStream for each column. > > Afterwards, I need to run a time consuming algorithm (outlier detection, > for now i use box plot algorithm) for each RDD of each PairDStream. > > To run the outlier detection, currently i am thinking about to call > collect on each of the PairDStream from method forEachRDD and then i get > the List of the items, and then pass the each list of items to a thread. > Each thread runs the outlier detection algorithm and process the result. > > I run the outlier detection in separate thread in order not to put too > much burden on spark streaming task. So, I would like to ask if this model > has a risk? or is there any alternatives provided by the framework such > that i don't have to run a separate thread for this? > > Thank you for your attention. > > > > -- > Best Regards, > Eko Susilo >