Hi there,
My team created a class extending RDD, and in the getPartitions method of which
we have a parallelized job. We noticed Spark hangs if we do shuffling on our
RDD instance.
I’m just wondering if it’s a valid use case and if the Spark team could provide
us with some suggestion.
We are using Spark 1.6.0 and here’s our code snippet:
import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.rdd.RDD
class testRDD(@transient sc: SparkContext)
extends RDD[(String, Int)](sc, Nil)
with Serializable{
override def getPartitions: Array[Partition] = {
sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect()
val result = new Array[Partition](4)
for (i <- 0 until 4) {
result(i) = new Partition {
override def index: Int = 0
}
}
result
}
override def compute(split: Partition, context: TaskContext):
Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator
}
val y = new testRDD(sc)
y.map(r => r).reduceByKey(_+_).count()
Regards,
Yanyan Zhang