I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.


On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaon...@gmail.com>wrote:

> Hi all,
>
> I notice that RDD.cartesian has a strange behavior with cached and
> uncached data. More precisely, I have a set of data that I load with
> objectFile
>
> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>
> Then I split it in two set depending on some criteria
>
>
> *val part1 = data.filter(_._2 matches "view1")*
> *val part2 = data.filter(_._2 matches "view2")*
>
>
> Finally, I compute the cartesian product of part1 and part2
>
> *val pair = part1.cartesian(part2)*
>
>
> If every thing goes well I should have
>
> *pair.count == part1.count * part2.count*
>
> But this is not the case if I don't cache part1 and part2.
>
> What I was missing ? Does caching data mandatory in Spark ?
>
> Cheers,
>
> Jaonary
>
>
>
>

Reply via email to