Re: Strange behavior of RDD.cartesian
You can find here a gist that illustrates this issue https://gist.github.com/jrabary/9953562 I got this with spark from master branch. On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash wrote: > Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a > hash collision bug that's fixed in 0.9.1 that might cause you to have too > few results in that join. > > Sent from my mobile phone > On Mar 28, 2014 8:04 PM, "Matei Zaharia" wrote: > >> Weird, how exactly are you pulling out the sample? Do you have a small >> program that reproduces this? >> >> Matei >> >> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa wrote: >> >> I forgot to mention that I don't really use all of my data. Instead I use >> a sample extracted with randomSample. >> >> >> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: >> >>> Hi all, >>> >>> I notice that RDD.cartesian has a strange behavior with cached and >>> uncached data. More precisely, I have a set of data that I load with >>> objectFile >>> >>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* >>> >>> Then I split it in two set depending on some criteria >>> >>> >>> *val part1 = data.filter(_._2 matches "view1")* >>> *val part2 = data.filter(_._2 matches "view2")* >>> >>> >>> Finally, I compute the cartesian product of part1 and part2 >>> >>> *val pair = part1.cartesian(part2)* >>> >>> >>> If every thing goes well I should have >>> >>> *pair.count == part1.count * part2.count* >>> >>> But this is not the case if I don't cache part1 and part2. >>> >>> What I was missing ? Does caching data mandatory in Spark ? >>> >>> Cheers, >>> >>> Jaonary >>> >>> >>> >>> >> >>
Re: Strange behavior of RDD.cartesian
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, "Matei Zaharia" wrote: > Weird, how exactly are you pulling out the sample? Do you have a small > program that reproduces this? > > Matei > > On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa wrote: > > I forgot to mention that I don't really use all of my data. Instead I use > a sample extracted with randomSample. > > > On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: > >> Hi all, >> >> I notice that RDD.cartesian has a strange behavior with cached and >> uncached data. More precisely, I have a set of data that I load with >> objectFile >> >> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* >> >> Then I split it in two set depending on some criteria >> >> >> *val part1 = data.filter(_._2 matches "view1")* >> *val part2 = data.filter(_._2 matches "view2")* >> >> >> Finally, I compute the cartesian product of part1 and part2 >> >> *val pair = part1.cartesian(part2)* >> >> >> If every thing goes well I should have >> >> *pair.count == part1.count * part2.count* >> >> But this is not the case if I don't cache part1 and part2. >> >> What I was missing ? Does caching data mandatory in Spark ? >> >> Cheers, >> >> Jaonary >> >> >> >> > >
Re: Strange behavior of RDD.cartesian
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa wrote: > I forgot to mention that I don't really use all of my data. Instead I use a > sample extracted with randomSample. > > > On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: > Hi all, > > I notice that RDD.cartesian has a strange behavior with cached and uncached > data. More precisely, I have a set of data that I load with objectFile > > val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data") > > Then I split it in two set depending on some criteria > > > val part1 = data.filter(_._2 matches "view1") > val part2 = data.filter(_._2 matches "view2") > > > Finally, I compute the cartesian product of part1 and part2 > > val pair = part1.cartesian(part2) > > > If every thing goes well I should have > > pair.count == part1.count * part2.count > > But this is not the case if I don't cache part1 and part2. > > What I was missing ? Does caching data mandatory in Spark ? > > Cheers, > > Jaonary > > > >
Re: Strange behavior of RDD.cartesian
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote: > Hi all, > > I notice that RDD.cartesian has a strange behavior with cached and > uncached data. More precisely, I have a set of data that I load with > objectFile > > *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")* > > Then I split it in two set depending on some criteria > > > *val part1 = data.filter(_._2 matches "view1")* > *val part2 = data.filter(_._2 matches "view2")* > > > Finally, I compute the cartesian product of part1 and part2 > > *val pair = part1.cartesian(part2)* > > > If every thing goes well I should have > > *pair.count == part1.count * part2.count* > > But this is not the case if I don't cache part1 and part2. > > What I was missing ? Does caching data mandatory in Spark ? > > Cheers, > > Jaonary > > > >