I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample.
On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaon...@gmail.com>wrote: > Hi all, > > I notice that RDD.cartesian has a strange behavior with cached and > uncached data. More precisely, I have a set of data that I load with > objectFile > > *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* > > Then I split it in two set depending on some criteria > > > *val part1 = data.filter(_._2 matches "view1")* > *val part2 = data.filter(_._2 matches "view2")* > > > Finally, I compute the cartesian product of part1 and part2 > > *val pair = part1.cartesian(part2)* > > > If every thing goes well I should have > > *pair.count == part1.count * part2.count* > > But this is not the case if I don't cache part1 and part2. > > What I was missing ? Does caching data mandatory in Spark ? > > Cheers, > > Jaonary > > > >