Re: Strange behavior of RDD.cartesian
You can find here a gist that illustrates this issue https://gist.github.com/jrabary/9953562 I got this with spark from master branch. On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash and...@andrewash.com wrote: Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)* Then I split it in two set depending on some criteria *val part1 = data.filter(_._2 matches view1)* *val part2 = data.filter(_._2 matches view2)* Finally, I compute the cartesian product of part1 and part2 *val pair = part1.cartesian(part2)* If every thing goes well I should have *pair.count == part1.count * part2.count* But this is not the case if I don't cache part1 and part2. What I was missing ? Does caching data mandatory in Spark ? Cheers, Jaonary
Re: Strange behavior of RDD.cartesian
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash collision bug that's fixed in 0.9.1 that might cause you to have too few results in that join. Sent from my mobile phone On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)* Then I split it in two set depending on some criteria *val part1 = data.filter(_._2 matches view1)* *val part2 = data.filter(_._2 matches view2)* Finally, I compute the cartesian product of part1 and part2 *val pair = part1.cartesian(part2)* If every thing goes well I should have *pair.count == part1.count * part2.count* But this is not the case if I don't cache part1 and part2. What I was missing ? Does caching data mandatory in Spark ? Cheers, Jaonary
Re: Strange behavior of RDD.cartesian
I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)* Then I split it in two set depending on some criteria *val part1 = data.filter(_._2 matches view1)* *val part2 = data.filter(_._2 matches view2)* Finally, I compute the cartesian product of part1 and part2 *val pair = part1.cartesian(part2)* If every thing goes well I should have *pair.count == part1.count * part2.count* But this is not the case if I don't cache part1 and part2. What I was missing ? Does caching data mandatory in Spark ? Cheers, Jaonary
Re: Strange behavior of RDD.cartesian
Weird, how exactly are you pulling out the sample? Do you have a small program that reproduces this? Matei On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: I forgot to mention that I don't really use all of my data. Instead I use a sample extracted with randomSample. On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, I notice that RDD.cartesian has a strange behavior with cached and uncached data. More precisely, I have a set of data that I load with objectFile val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data) Then I split it in two set depending on some criteria val part1 = data.filter(_._2 matches view1) val part2 = data.filter(_._2 matches view2) Finally, I compute the cartesian product of part1 and part2 val pair = part1.cartesian(part2) If every thing goes well I should have pair.count == part1.count * part2.count But this is not the case if I don't cache part1 and part2. What I was missing ? Does caching data mandatory in Spark ? Cheers, Jaonary