Hi All, I have a problem that i would like to consult about spark streaming.
I have a spark streaming application that parse a file (which will be growing as time passed by)This file contains several columns containing lines of numbers, these parsing is divided into windows (each 1 minute). Each column represent different entity while each row within a column represent the same entity (for example, first column represent temprature, second column represent humidty, etc, while each row represent the value of each attribute). I use PairDStream for each column. Afterwards, I need to run a time consuming algorithm (outlier detection, for now i use box plot algorithm) for each RDD of each PairDStream. To run the outlier detection, currently i am thinking about to call collect on each of the PairDStream from method forEachRDD and then i get the List of the items, and then pass the each list of items to a thread. Each thread runs the outlier detection algorithm and process the result. I run the outlier detection in separate thread in order not to put too much burden on spark streaming task. So, I would like to ask if this model has a risk? or is there any alternatives provided by the framework such that i don't have to run a separate thread for this? Thank you for your attention. -- Best Regards, Eko Susilo