Your rdd2 and rdd3 differ in two ways so it's hard to track the exact
effect of caching. In rdd3, in addition to the fact that rdd will be
cached, you are also doing a bunch of extra random number generation. So it
will be hard to isolate the effect of caching.


On Wed, Aug 20, 2014 at 7:48 AM, Grzegorz Białek <
grzegorz.bia...@codilime.com> wrote:

> Hi,
>
> I tried to write small program which shows that using cache() can speed up
> execution but results with and without cache were similar. Could help me
> with this issue? I tried to compute rdd and use it later in two places and
> I thought in second usage this rdd is recomputed but it doesn't:
>
>   val help = sc.parallelize(Array.range(1, 20000)).repartition(100)
>     .map(x => (scala.util.Random.nextInt(10), x))
>   val rdd = sc.parallelize(Array.range(1,20000))
>     .repartition(100)
>     .map(x => (scala.util.Random.nextInt(10), x))
>     .join(help)
>     .map { case (x, (n, i)) => (x, n)}
>     .reduceByKey(_ + _)
>     .cache()
>
>   val rdd2 = sc.parallelize(Array.range(1,1000)).map(x => (x, x))
>     .join(rdd).saveAsTextFile("output/1")
>   val rdd3 = sc.parallelize(Array.range(1,1000)).map(x =>
> (scala.util.Random.nextInt(1000), x))
>     .join(rdd).saveAsTextFile("output/2")
>
> Thanks,
> Grzegorz
>

Reply via email to