Hi,

I tried to write small program which shows that using cache() can speed up
execution but results with and without cache were similar. Could help me
with this issue? I tried to compute rdd and use it later in two places and
I thought in second usage this rdd is recomputed but it doesn't:

  val help = sc.parallelize(Array.range(1, 20000)).repartition(100)
    .map(x => (scala.util.Random.nextInt(10), x))
  val rdd = sc.parallelize(Array.range(1,20000))
    .repartition(100)
    .map(x => (scala.util.Random.nextInt(10), x))
    .join(help)
    .map { case (x, (n, i)) => (x, n)}
    .reduceByKey(_ + _)
    .cache()

  val rdd2 = sc.parallelize(Array.range(1,1000)).map(x => (x, x))
    .join(rdd).saveAsTextFile("output/1")
  val rdd3 = sc.parallelize(Array.range(1,1000)).map(x =>
(scala.util.Random.nextInt(1000), x))
    .join(rdd).saveAsTextFile("output/2")

Thanks,
Grzegorz

Reply via email to