ReduceByKey but with different functions depending on key

2014-11-18 Thread jelgh
Hello everyone,

I'm new to Spark and I have the following problem:

I have this large JavaRDDMyClass collection, which I group with by
creating a hashcode from some fields in MyClass:

JavaRDDMyClass collection = ...;
JavaPairRDDInteger, Iterablelt;MyClass grouped =
collection.groupBy(...); // the group-function is just creating a hashcode
from some fields in MyClass.

Now I want to reduce the variable grouped. However, I want to reduce it with
different functions depending on the key in the JavaPairRDD. So basically a
reduceByKey but with multiple functions.

Only solution I've come up with is by filtering grouped for each reduce
function and apply it on the filtered  subsets. This feels kinda hackish
though. 

Is there a better way? 

Best regards,
Johannes



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Yanbo
First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]).
Then use map() on this RDD with a function has different operations depending 
on the key which act as a parameter of this function.


 在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道:
 
 Hello everyone,
 
 I'm new to Spark and I have the following problem:
 
 I have this large JavaRDDMyClass collection, which I group with by
 creating a hashcode from some fields in MyClass:
 
 JavaRDDMyClass collection = ...;
 JavaPairRDDInteger, Iterablelt;MyClass grouped =
 collection.groupBy(...); // the group-function is just creating a hashcode
 from some fields in MyClass.
 
 Now I want to reduce the variable grouped. However, I want to reduce it with
 different functions depending on the key in the JavaPairRDD. So basically a
 reduceByKey but with multiple functions.
 
 Only solution I've come up with is by filtering grouped for each reduce
 function and apply it on the filtered  subsets. This feels kinda hackish
 though. 
 
 Is there a better way? 
 
 Best regards,
 Johannes
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread Debasish Das
groupByKey does not run a combiner so be careful about the
performance...groupByKey does shuffle even for local groups...

reduceByKey and aggregateByKey does run a combiner but if you want a
separate function for each key, you can have a key to closure map that you
can broadcast and use it in reduceByKey if you have access to the key in
reduceByKey/aggregateByKey...

I did not have the need to access the key in reduceByKey/aggregateByKey yet
but there should be a way...

On Tue, Nov 18, 2014 at 7:24 AM, Yanbo yanboha...@gmail.com wrote:

 First use groupByKey(), you get a tuple RDD with
 (key:K,value:ArrayBuffer[V]).
 Then use map() on this RDD with a function has different operations
 depending on the key which act as a parameter of this function.


  在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道:
 
  Hello everyone,
 
  I'm new to Spark and I have the following problem:
 
  I have this large JavaRDDMyClass collection, which I group with by
  creating a hashcode from some fields in MyClass:
 
  JavaRDDMyClass collection = ...;
  JavaPairRDDInteger, Iterablelt;MyClass grouped =
  collection.groupBy(...); // the group-function is just creating a
 hashcode
  from some fields in MyClass.
 
  Now I want to reduce the variable grouped. However, I want to reduce it
 with
  different functions depending on the key in the JavaPairRDD. So
 basically a
  reduceByKey but with multiple functions.
 
  Only solution I've come up with is by filtering grouped for each reduce
  function and apply it on the filtered  subsets. This feels kinda hackish
  though.
 
  Is there a better way?
 
  Best regards,
  Johannes
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: ReduceByKey but with different functions depending on key

2014-11-18 Thread lordjoe
Map the key value into a key,Tuple2key,value and process that -
Also ask the Spark maintainers for a version of keyed operations where the
key is passed in as an argument - I run into these cases all the time

/**
 * map a tuple int a key tuple pair to insure subsequent processing has
access to both Key and value
 * @param inp input pair RDD
 * @param K   key type
 * @param V   value type
 * @return   output where value has both key and value
 */
   @Nonnull
   public static K extends Serializable, V extends Serializable
JavaPairRDDK,Tuple2lt;K,  V toKeyedTuples(@Nonnull JavaPairRDD K, V
inp) {
 return inp.flatMapToPair(new PairFlatMapFunctionTuple2lt;K,
V, K, Tuple2K, V() {
   @Override
   public IterableTuple2lt;K, Tuple2lt;K, V call(final
Tuple2K, V t) throws Exception {
   return   new Tuple2K, Tuple2lt;K, V(t._1(),new
Tuple2K,V(t._1(),t._2());
   }
   });
   }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org