Hi Mayur,
Thanks for your suggestion.
In fact, that's i'm thinking about; to pass those data, and return only the
percentage of the outlier in a particular window.
I also have some doubt if i would implement the outlier detection on rdd as
you have suggested.
From what i understand that those RDD are distributed among spark workers;
so, i imagine that i would do as the following (code_905 is a PairDStream)
code_905.foreachRDD(new Function2JavaPairRDDString,Long,Time,Void(){
public Void call(JavaPairRDDString, Long pair,Time time) throws Exception
{
if(pair.count()0){
final ListDouble data=new LinkedListDouble();
pair.foreach(new VoidFunctionTuple2String,Long(){
@Override
public void call(Tuple2String, Long t)
throws Exception {
double doubleValue=t._2.doubleValue();
//register data from this window to be checked
data.add(doubleValue);
//register the data to the outlier detector
outlierDetector.addData(doubleValue);
}
});
//get percentage of the outlier for
this window.
double percentage=outlierDetector.getOutlierPercentageFromThisData(data);
}
return null;
}
});
the variable outlierDetector is declared on class static variable. the
call outlierDetector.addData is needed because i would like to run the
outlier detection from the data obtained from previous window(s).
My concern on writing the, outlier detection on spark is it would slow down
the spark streaming since, the outlier detection would involve sorting
data, calculating some statistic stuff. especially, i would need to run
many instances of outlier detection (each instances to handle different
set of data). So, what do you think about this model?
On Wed, Oct 1, 2014 at 1:59 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
Calling collect on anything is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system
never see it again :) .
I would say you need to implement outlier detection on the rdd process
it in spark itself rather than calling collect on it.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo eko.harmawan.sus...@gmail.com
wrote:
Hi All,
I have a problem that i would like to consult about spark streaming.
I have a spark streaming application that parse a file (which will be
growing as time passed by)This file contains several columns containing
lines of numbers,
these parsing is divided into windows (each 1 minute). Each column
represent different entity while each row within a column represent the
same entity (for example, first column represent temprature, second column
represent humidty, etc, while each row represent the value of each
attribute). I use PairDStream for each column.
Afterwards, I need to run a time consuming algorithm (outlier detection,
for now i use box plot algorithm) for each RDD of each PairDStream.
To run the outlier detection, currently i am thinking about to call
collect on each of the PairDStream from method forEachRDD and then i get
the List of the items, and then pass the each list of items to a thread.
Each thread runs the outlier detection algorithm and process the result.
I run the outlier detection in separate thread in order not to put too
much burden on spark streaming task. So, I would like to ask if this model
has a risk? or is there any alternatives provided by the framework such
that i don't have to run a separate thread for this?
Thank you for your attention.
--
Best Regards,
Eko Susilo
--
Best Regards,
Eko Susilo