Hi All,

I have a problem that i would like to consult about spark streaming.

I have a spark streaming application that parse a file (which will be
growing as time passed by)This file contains several columns containing
lines of numbers,
these parsing is divided into windows (each 1 minute). Each column
represent different entity while each row within a column represent the
same entity (for example, first column represent temprature, second column
represent humidty, etc, while each row represent the value of each
attribute). I use PairDStream for each column.

Afterwards, I need to run a time consuming algorithm (outlier detection,
for now i use box plot algorithm) for each RDD of each PairDStream.

To run the outlier detection, currently i am thinking about to call collect
on each of the PairDStream from method forEachRDD and then i get the List
of the items, and then pass the each list of items to a thread. Each thread
runs the outlier detection algorithm and process the result.

I run the outlier detection in separate thread in order not to put too much
burden on spark streaming task. So, I would like to ask if this model has a
risk? or is there any alternatives provided by the framework such that i
don't have to run a separate thread for this?

Thank you for your attention.



-- 
Best Regards,
Eko Susilo

Reply via email to