Hi, I tried to write small program which shows that using cache() can speed up execution but results with and without cache were similar. Could help me with this issue? I tried to compute rdd and use it later in two places and I thought in second usage this rdd is recomputed but it doesn't:
val help = sc.parallelize(Array.range(1, 20000)).repartition(100) .map(x => (scala.util.Random.nextInt(10), x)) val rdd = sc.parallelize(Array.range(1,20000)) .repartition(100) .map(x => (scala.util.Random.nextInt(10), x)) .join(help) .map { case (x, (n, i)) => (x, n)} .reduceByKey(_ + _) .cache() val rdd2 = sc.parallelize(Array.range(1,1000)).map(x => (x, x)) .join(rdd).saveAsTextFile("output/1") val rdd3 = sc.parallelize(Array.range(1,1000)).map(x => (scala.util.Random.nextInt(1000), x)) .join(rdd).saveAsTextFile("output/2") Thanks, Grzegorz