ReduceByKey but with different functions depending on key
Hello everyone, I'm new to Spark and I have the following problem: I have this large JavaRDDMyClass collection, which I group with by creating a hashcode from some fields in MyClass: JavaRDDMyClass collection = ...; JavaPairRDDInteger, Iterablelt;MyClass grouped = collection.groupBy(...); // the group-function is just creating a hashcode from some fields in MyClass. Now I want to reduce the variable grouped. However, I want to reduce it with different functions depending on the key in the JavaPairRDD. So basically a reduceByKey but with multiple functions. Only solution I've come up with is by filtering grouped for each reduce function and apply it on the filtered subsets. This feels kinda hackish though. Is there a better way? Best regards, Johannes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ReduceByKey but with different functions depending on key
First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]). Then use map() on this RDD with a function has different operations depending on the key which act as a parameter of this function. 在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道: Hello everyone, I'm new to Spark and I have the following problem: I have this large JavaRDDMyClass collection, which I group with by creating a hashcode from some fields in MyClass: JavaRDDMyClass collection = ...; JavaPairRDDInteger, Iterablelt;MyClass grouped = collection.groupBy(...); // the group-function is just creating a hashcode from some fields in MyClass. Now I want to reduce the variable grouped. However, I want to reduce it with different functions depending on the key in the JavaPairRDD. So basically a reduceByKey but with multiple functions. Only solution I've come up with is by filtering grouped for each reduce function and apply it on the filtered subsets. This feels kinda hackish though. Is there a better way? Best regards, Johannes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ReduceByKey but with different functions depending on key
groupByKey does not run a combiner so be careful about the performance...groupByKey does shuffle even for local groups... reduceByKey and aggregateByKey does run a combiner but if you want a separate function for each key, you can have a key to closure map that you can broadcast and use it in reduceByKey if you have access to the key in reduceByKey/aggregateByKey... I did not have the need to access the key in reduceByKey/aggregateByKey yet but there should be a way... On Tue, Nov 18, 2014 at 7:24 AM, Yanbo yanboha...@gmail.com wrote: First use groupByKey(), you get a tuple RDD with (key:K,value:ArrayBuffer[V]). Then use map() on this RDD with a function has different operations depending on the key which act as a parameter of this function. 在 2014年11月18日,下午8:59,jelgh johannes.e...@gmail.com 写道: Hello everyone, I'm new to Spark and I have the following problem: I have this large JavaRDDMyClass collection, which I group with by creating a hashcode from some fields in MyClass: JavaRDDMyClass collection = ...; JavaPairRDDInteger, Iterablelt;MyClass grouped = collection.groupBy(...); // the group-function is just creating a hashcode from some fields in MyClass. Now I want to reduce the variable grouped. However, I want to reduce it with different functions depending on the key in the JavaPairRDD. So basically a reduceByKey but with multiple functions. Only solution I've come up with is by filtering grouped for each reduce function and apply it on the filtered subsets. This feels kinda hackish though. Is there a better way? Best regards, Johannes -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: ReduceByKey but with different functions depending on key
Map the key value into a key,Tuple2key,value and process that - Also ask the Spark maintainers for a version of keyed operations where the key is passed in as an argument - I run into these cases all the time /** * map a tuple int a key tuple pair to insure subsequent processing has access to both Key and value * @param inp input pair RDD * @param K key type * @param V value type * @return output where value has both key and value */ @Nonnull public static K extends Serializable, V extends Serializable JavaPairRDDK,Tuple2lt;K, V toKeyedTuples(@Nonnull JavaPairRDD K, V inp) { return inp.flatMapToPair(new PairFlatMapFunctionTuple2lt;K, V, K, Tuple2K, V() { @Override public IterableTuple2lt;K, Tuple2lt;K, V call(final Tuple2K, V t) throws Exception { return new Tuple2K, Tuple2lt;K, V(t._1(),new Tuple2K,V(t._1(),t._2()); } }); } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ReduceByKey-but-with-different-functions-depending-on-key-tp19177p19198.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org