Re: Spark Streaming for time consuming job

2014-10-02 Thread Eko Susilo
Hi Mayur,

Thanks for your suggestion.

In fact, that's i'm thinking about; to pass those data, and return only the
percentage of the outlier in a particular window.

I also have some doubt if i would implement the outlier detection on rdd as
you have suggested.

From what i understand that those RDD are distributed among spark workers;
so, i imagine that i would do as the following (code_905 is a PairDStream)

code_905.foreachRDD(new Function2JavaPairRDDString,Long,Time,Void(){
public Void call(JavaPairRDDString, Long pair,Time time) throws Exception
{
if(pair.count()0){
final ListDouble data=new LinkedListDouble();
pair.foreach(new VoidFunctionTuple2String,Long(){
 @Override
public void call(Tuple2String, Long t)
throws Exception {
 double doubleValue=t._2.doubleValue();
//register data from this window to be checked

data.add(doubleValue);
//register the data to the outlier detector
outlierDetector.addData(doubleValue);
}
 });
  //get percentage of the outlier for
this window.
double percentage=outlierDetector.getOutlierPercentageFromThisData(data);

 }
return null;
}
});

the variable outlierDetector is declared on class static variable.  the
call outlierDetector.addData is needed because i would like to run the
outlier detection from the data obtained from previous window(s).

My concern on writing the, outlier detection on spark is it would slow down
the spark streaming since, the outlier detection would involve sorting
data, calculating some statistic stuff. especially, i would need to run
many instances of outlier detection  (each instances to handle different
set of data).  So, what do you think about this model?






On Wed, Oct 1, 2014 at 1:59 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:

 Calling collect on anything  is almost always a bad idea. The only
 exception is if you are looking to pass that data on to any other system 
 never see it again :) .
 I would say you need to implement outlier detection on the rdd  process
 it in spark itself rather than calling collect on it.

 Regards
 Mayur

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi https://twitter.com/mayur_rustagi


 On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo eko.harmawan.sus...@gmail.com
  wrote:

 Hi All,

 I have a problem that i would like to consult about spark streaming.

 I have a spark streaming application that parse a file (which will be
 growing as time passed by)This file contains several columns containing
 lines of numbers,
 these parsing is divided into windows (each 1 minute). Each column
 represent different entity while each row within a column represent the
 same entity (for example, first column represent temprature, second column
 represent humidty, etc, while each row represent the value of each
 attribute). I use PairDStream for each column.

 Afterwards, I need to run a time consuming algorithm (outlier detection,
 for now i use box plot algorithm) for each RDD of each PairDStream.

 To run the outlier detection, currently i am thinking about to call
 collect on each of the PairDStream from method forEachRDD and then i get
 the List of the items, and then pass the each list of items to a thread.
 Each thread runs the outlier detection algorithm and process the result.

 I run the outlier detection in separate thread in order not to put too
 much burden on spark streaming task. So, I would like to ask if this model
 has a risk? or is there any alternatives provided by the framework such
 that i don't have to run a separate thread for this?

 Thank you for your attention.



 --
 Best Regards,
 Eko Susilo





-- 
Best Regards,
Eko Susilo


Re: Spark Streaming for time consuming job

2014-10-01 Thread Mayur Rustagi
Calling collect on anything  is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system 
never see it again :) .
I would say you need to implement outlier detection on the rdd  process it
in spark itself rather than calling collect on it.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi


On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo eko.harmawan.sus...@gmail.com
wrote:

 Hi All,

 I have a problem that i would like to consult about spark streaming.

 I have a spark streaming application that parse a file (which will be
 growing as time passed by)This file contains several columns containing
 lines of numbers,
 these parsing is divided into windows (each 1 minute). Each column
 represent different entity while each row within a column represent the
 same entity (for example, first column represent temprature, second column
 represent humidty, etc, while each row represent the value of each
 attribute). I use PairDStream for each column.

 Afterwards, I need to run a time consuming algorithm (outlier detection,
 for now i use box plot algorithm) for each RDD of each PairDStream.

 To run the outlier detection, currently i am thinking about to call
 collect on each of the PairDStream from method forEachRDD and then i get
 the List of the items, and then pass the each list of items to a thread.
 Each thread runs the outlier detection algorithm and process the result.

 I run the outlier detection in separate thread in order not to put too
 much burden on spark streaming task. So, I would like to ask if this model
 has a risk? or is there any alternatives provided by the framework such
 that i don't have to run a separate thread for this?

 Thank you for your attention.



 --
 Best Regards,
 Eko Susilo