Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
You can find here a gist that illustrates this issue
https://gist.github.com/jrabary/9953562
I got this with spark from master branch.


On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash  wrote:

> Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a
> hash collision bug that's fixed in 0.9.1 that might cause you to have too
> few results in that join.
>
> Sent from my mobile phone
> On Mar 28, 2014 8:04 PM, "Matei Zaharia"  wrote:
>
>> Weird, how exactly are you pulling out the sample? Do you have a small
>> program that reproduces this?
>>
>> Matei
>>
>> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa  wrote:
>>
>> I forgot to mention that I don't really use all of my data. Instead I use
>> a sample extracted with randomSample.
>>
>>
>> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote:
>>
>>> Hi all,
>>>
>>> I notice that RDD.cartesian has a strange behavior with cached and
>>> uncached data. More precisely, I have a set of data that I load with
>>> objectFile
>>>
>>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>>>
>>> Then I split it in two set depending on some criteria
>>>
>>>
>>> *val part1 = data.filter(_._2 matches "view1")*
>>> *val part2 = data.filter(_._2 matches "view2")*
>>>
>>>
>>> Finally, I compute the cartesian product of part1 and part2
>>>
>>> *val pair = part1.cartesian(part2)*
>>>
>>>
>>> If every thing goes well I should have
>>>
>>> *pair.count == part1.count * part2.count*
>>>
>>> But this is not the case if I don't cache part1 and part2.
>>>
>>> What I was missing ? Does caching data mandatory in Spark ?
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>>
>>
>>


Re: Strange behavior of RDD.cartesian

2014-03-29 Thread Andrew Ash
Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a hash
collision bug that's fixed in 0.9.1 that might cause you to have too few
results in that join.

Sent from my mobile phone
On Mar 28, 2014 8:04 PM, "Matei Zaharia"  wrote:

> Weird, how exactly are you pulling out the sample? Do you have a small
> program that reproduces this?
>
> Matei
>
> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa  wrote:
>
> I forgot to mention that I don't really use all of my data. Instead I use
> a sample extracted with randomSample.
>
>
> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote:
>
>> Hi all,
>>
>> I notice that RDD.cartesian has a strange behavior with cached and
>> uncached data. More precisely, I have a set of data that I load with
>> objectFile
>>
>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>>
>> Then I split it in two set depending on some criteria
>>
>>
>> *val part1 = data.filter(_._2 matches "view1")*
>> *val part2 = data.filter(_._2 matches "view2")*
>>
>>
>> Finally, I compute the cartesian product of part1 and part2
>>
>> *val pair = part1.cartesian(part2)*
>>
>>
>> If every thing goes well I should have
>>
>> *pair.count == part1.count * part2.count*
>>
>> But this is not the case if I don't cache part1 and part2.
>>
>> What I was missing ? Does caching data mandatory in Spark ?
>>
>> Cheers,
>>
>> Jaonary
>>
>>
>>
>>
>
>


Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Matei Zaharia
Weird, how exactly are you pulling out the sample? Do you have a small program 
that reproduces this?

Matei

On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa  wrote:

> I forgot to mention that I don't really use all of my data. Instead I use a 
> sample extracted with randomSample. 
> 
> 
> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa  wrote:
> Hi all,
> 
> I notice that RDD.cartesian has a strange behavior with cached and uncached 
> data. More precisely, I have a set of data that I load with objectFile
> 
> val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")
> 
> Then I split it in two set depending on some criteria
> 
> 
> val part1 = data.filter(_._2 matches "view1")
> val part2 = data.filter(_._2 matches "view2")
> 
> 
> Finally, I compute the cartesian product of part1 and part2
> 
> val pair = part1.cartesian(part2)
> 
> 
> If every thing goes well I should have 
> 
> pair.count == part1.count * part2.count
> 
> But this is not the case if I don't cache part1 and part2.
> 
> What I was missing ? Does caching data mandatory in Spark ?
> 
> Cheers,
> 
> Jaonary
> 
> 
> 
> 



Re: Strange behavior of RDD.cartesian

2014-03-28 Thread Jaonary Rabarisoa
I forgot to mention that I don't really use all of my data. Instead I use a
sample extracted with randomSample.


On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa wrote:

> Hi all,
>
> I notice that RDD.cartesian has a strange behavior with cached and
> uncached data. More precisely, I have a set of data that I load with
> objectFile
>
> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>
> Then I split it in two set depending on some criteria
>
>
> *val part1 = data.filter(_._2 matches "view1")*
> *val part2 = data.filter(_._2 matches "view2")*
>
>
> Finally, I compute the cartesian product of part1 and part2
>
> *val pair = part1.cartesian(part2)*
>
>
> If every thing goes well I should have
>
> *pair.count == part1.count * part2.count*
>
> But this is not the case if I don't cache part1 and part2.
>
> What I was missing ? Does caching data mandatory in Spark ?
>
> Cheers,
>
> Jaonary
>
>
>
>