Multiple Filter Effiency

2014-12-16 Thread zkidkid
Hi,
Currently I am trying to count on a document with multiple filter.
Let say, here is my document:

//user field1 field2 field3
user1 0 0 1
user2 0 1 0
user3 0 0 0

I want to count on user.log for some filters like this:

Filter1: field1 == 0  field 2 = 0
Filter2: field1 == 0  field 3 = 1
Filter3: field1 == 0  field 3 = 0
...
and total line.

I have tried and I found that I couldn't use group by or map then reduce
because a line could match two or more filter.

My idea now is foreach line and then maintain a outsite counter service.

Forexample:

JavaRDDString textFile = sc.textFile(hdfs, 10);
long start = System.currentTimeMillis();

textFile.foreach(new VoidFunctionString() {

public void call(String s) {
   foreach(MyFilter filter: MyFilters){
   if(filter.match(s)) filter.increaseOwnCounter();
   }
}
});


I would happy if there have another way to do it, any help is appreciate.
Thanks in advance.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Filter-Effiency-tp20701.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multiple Filter Effiency

2014-12-16 Thread Imran Rashid
I think accumulators do exactly what you want.

(Scala syntax below, I'm just not familiar with the Java equivalent ...)

val f1counts = sc.accumulator (0)
val f2counts = sc.accumulator (0)
val f3counts = sc.accumulator (0)

textfile.foreach { s =
  if(f1matches) f1counts += 1
  ...
}

Note that you could also do a normal map reduce even though a record might
match more than one filter.  In the scala api you can use flatmap to output
zero or more records:

textfile.flatmap { s =
  Seq (
 (if (f1matches) Some (f1 - 1) else None),
 ...
).flatten
}.reduceByKey { _ + _ }
On Dec 16, 2014 2:07 AM, zkidkid zkid...@gmail.com wrote:

 Hi,
 Currently I am trying to count on a document with multiple filter.
 Let say, here is my document:

 //user field1 field2 field3
 user1 0 0 1
 user2 0 1 0
 user3 0 0 0

 I want to count on user.log for some filters like this:

 Filter1: field1 == 0  field 2 = 0
 Filter2: field1 == 0  field 3 = 1
 Filter3: field1 == 0  field 3 = 0
 ...
 and total line.

 I have tried and I found that I couldn't use group by or map then
 reduce
 because a line could match two or more filter.

 My idea now is foreach line and then maintain a outsite counter service.

 Forexample:

 JavaRDDString textFile = sc.textFile(hdfs, 10);
 long start = System.currentTimeMillis();

 textFile.foreach(new VoidFunctionString() {

 public void call(String s) {
foreach(MyFilter filter: MyFilters){
if(filter.match(s)) filter.increaseOwnCounter();
}
 }
 });


 I would happy if there have another way to do it, any help is appreciate.
 Thanks in advance.





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-Filter-Effiency-tp20701.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org