Re: Strange behavior of RDD.cartesian

Jaonary Rabarisoa Thu, 03 Apr 2014 05:46:35 -0700

You can find here a gist that illustrates this issue
https://gist.github.com/jrabary/9953562
I got this with spark from master branch.



On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash <and...@andrewash.com> wrote:

> Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a
> hash collision bug that's fixed in 0.9.1 that might cause you to have too
> few results in that join.
>
> Sent from my mobile phone
> On Mar 28, 2014 8:04 PM, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>
>> Weird, how exactly are you pulling out the sample? Do you have a small
>> program that reproduces this?
>>
>> Matei
>>
>> On Mar 28, 2014, at 3:09 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
>>
>> I forgot to mention that I don't really use all of my data. Instead I use
>> a sample extracted with randomSample.
>>
>>
>> On Fri, Mar 28, 2014 at 10:58 AM, Jaonary Rabarisoa <jaon...@gmail.com>wrote:
>>
>>> Hi all,
>>>
>>> I notice that RDD.cartesian has a strange behavior with cached and
>>> uncached data. More precisely, I have a set of data that I load with
>>> objectFile
>>>
>>> *val data: RDD[(Int,String,Array[Double])] = sc.objectFile("data")*
>>>
>>> Then I split it in two set depending on some criteria
>>>
>>>
>>> *val part1 = data.filter(_._2 matches "view1")*
>>> *val part2 = data.filter(_._2 matches "view2")*
>>>
>>>
>>> Finally, I compute the cartesian product of part1 and part2
>>>
>>> *val pair = part1.cartesian(part2)*
>>>
>>>
>>> If every thing goes well I should have
>>>
>>> *pair.count == part1.count * part2.count*
>>>
>>> But this is not the case if I don't cache part1 and part2.
>>>
>>> What I was missing ? Does caching data mandatory in Spark ?
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>>
>>
>>

Re: Strange behavior of RDD.cartesian

Reply via email to