Hi Mayur,

Thanks for your suggestion.

In fact, that's i'm thinking about; to pass those data, and return only the
percentage of the outlier in a particular window.

I also have some doubt if i would implement the outlier detection on rdd as
you have suggested.

>From what i understand that those RDD are distributed among spark workers;
so, i imagine that i would do as the following (code_905 is a PairDStream)

code_905.foreachRDD(new Function2<JavaPairRDD<String,Long>,Time,Void>(){
public Void call(JavaPairRDD<String, Long> pair,Time time) throws Exception
{
if(pair.count()>0){
final List<Double> data=new LinkedList<Double>();
pair.foreach(new VoidFunction<Tuple2<String,Long>>(){
 @Override
public void call(Tuple2<String, Long> t)
throws Exception {
 double doubleValue=t._2.doubleValue();
//register data from this window to be checked

data.add(doubleValue);
//register the data to the outlier detector
outlierDetector.addData(doubleValue);
}
 });
                                      //get percentage of the outlier for
this window.
double percentage=outlierDetector.getOutlierPercentageFromThisData(data);

 }
return null;
}
});

the variable outlierDetector is declared on class static variable.  the
call "outlierDetector.addData" is needed because i would like to run the
outlier detection from the data obtained from previous window(s).

My concern on writing the, outlier detection on spark is it would slow down
the spark streaming since, the outlier detection would involve sorting
data, calculating some statistic stuff. especially, i would need to run
many instances of outlier detection  (each instances to handle different
set of data).  So, what do you think about this model?






On Wed, Oct 1, 2014 at 1:59 PM, Mayur Rustagi <mayur.rust...@gmail.com>
wrote:

> Calling collect on anything  is almost always a bad idea. The only
> exception is if you are looking to pass that data on to any other system &
> never see it again :) .
> I would say you need to implement outlier detection on the rdd & process
> it in spark itself rather than calling collect on it.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <eko.harmawan.sus...@gmail.com
> > wrote:
>
>> Hi All,
>>
>> I have a problem that i would like to consult about spark streaming.
>>
>> I have a spark streaming application that parse a file (which will be
>> growing as time passed by)This file contains several columns containing
>> lines of numbers,
>> these parsing is divided into windows (each 1 minute). Each column
>> represent different entity while each row within a column represent the
>> same entity (for example, first column represent temprature, second column
>> represent humidty, etc, while each row represent the value of each
>> attribute). I use PairDStream for each column.
>>
>> Afterwards, I need to run a time consuming algorithm (outlier detection,
>> for now i use box plot algorithm) for each RDD of each PairDStream.
>>
>> To run the outlier detection, currently i am thinking about to call
>> collect on each of the PairDStream from method forEachRDD and then i get
>> the List of the items, and then pass the each list of items to a thread.
>> Each thread runs the outlier detection algorithm and process the result.
>>
>> I run the outlier detection in separate thread in order not to put too
>> much burden on spark streaming task. So, I would like to ask if this model
>> has a risk? or is there any alternatives provided by the framework such
>> that i don't have to run a separate thread for this?
>>
>> Thank you for your attention.
>>
>>
>>
>> --
>> Best Regards,
>> Eko Susilo
>>
>
>


-- 
Best Regards,
Eko Susilo

Reply via email to