Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
You can find here a gist that illustrates this issue
https://gist.github.com/jrabary/9953562
I got this with spark from master branch.


On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash and...@andrewash.com wrote:

 Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a
 hash collision bug that's fixed in 0.9.1 that might cause you to have too
 few results in that join.

 Sent from my mobile phone
 On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Weird, how exactly are you pulling out the sample? Do you have a small
 program that reproduces this?

 Matei

 On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 I forgot to mention that I don't really use all of my data. Instead I use
 a sample extracted with randomSample.


 On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:

 Hi all,

 I notice that RDD.cartesian has a strange behavior with cached and
 uncached data. More precisely, I have a set of data that I load with
 objectFile

 *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)*

 Then I split it in two set depending on some criteria


 *val part1 = data.filter(_._2 matches view1)*
 *val part2 = data.filter(_._2 matches view2)*


 Finally, I compute the cartesian product of part1 and part2

 *val pair = part1.cartesian(part2)*


 If every thing goes well I should have

 *pair.count == part1.count * part2.count*

 But this is not the case if I don't cache part1 and part2.

 What I was missing ? Does caching data mandatory in Spark ?

 Cheers,

 Jaonary








Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.

Sent from my mobile phone
On Mar 28, 2014 8:04 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

 Weird, how exactly are you pulling out the sample? Do you have a small
 program that reproduces this?

 Matei

 On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 I forgot to mention that I don't really use all of my data. Instead I use
 a sample extracted with randomSample.


 On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:

 Hi all,

 I notice that RDD.cartesian has a strange behavior with cached and
 uncached data. More precisely, I have a set of data that I load with
 objectFile

 *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)*

 Then I split it in two set depending on some criteria


 *val part1 = data.filter(_._2 matches view1)*
 *val part2 = data.filter(_._2 matches view2)*


 Finally, I compute the cartesian product of part1 and part2

 *val pair = part1.cartesian(part2)*


 If every thing goes well I should have

 *pair.count == part1.count * part2.count*

 But this is not the case if I don't cache part1 and part2.

 What I was missing ? Does caching data mandatory in Spark ?

 Cheers,

 Jaonary








Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.


On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.comwrote:

 Hi all,

 I notice that RDD.cartesian has a strange behavior with cached and
 uncached data. More precisely, I have a set of data that I load with
 objectFile

 *val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)*

 Then I split it in two set depending on some criteria


 *val part1 = data.filter(_._2 matches view1)*
 *val part2 = data.filter(_._2 matches view2)*


 Finally, I compute the cartesian product of part1 and part2

 *val pair = part1.cartesian(part2)*


 If every thing goes well I should have

 *pair.count == part1.count * part2.count*

 But this is not the case if I don't cache part1 and part2.

 What I was missing ? Does caching data mandatory in Spark ?

 Cheers,

 Jaonary






Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program 
that reproduces this?

Matei

On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 I forgot to mention that I don't really use all of my data. Instead I use a 
 sample extracted with randomSample. 
 
 
 On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Hi all,
 
 I notice that RDD.cartesian has a strange behavior with cached and uncached 
 data. More precisely, I have a set of data that I load with objectFile
 
 val data: RDD[(Int,String,Array[Double])] = sc.objectFile(data)
 
 Then I split it in two set depending on some criteria
 
 
 val part1 = data.filter(_._2 matches view1)
 val part2 = data.filter(_._2 matches view2)
 
 
 Finally, I compute the cartesian product of part1 and part2
 
 val pair = part1.cartesian(part2)
 
 
 If every thing goes well I should have 
 
 pair.count == part1.count * part2.count
 
 But this is not the case if I don't cache part1 and part2.
 
 What I was missing ? Does caching data mandatory in Spark ?
 
 Cheers,
 
 Jaonary