Re: Cartesian join on RDDs taking too much time

Priya Ch Wed, 25 May 2016 03:11:16 -0700

Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I
am converting the joined dataframe to rdd (dataframe.rdd) and using
saveAsTextFile, trying to save it. However, this is also taking too much
time.


Thanks,
Padma Ch

On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Hi,
>
> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
> because it always needs shuffle operations which have alot of overheads
> such as reflection, serialization, ...
> In your case,  since the smaller table is 7mb, DataFrame#join uses a
> broadcast strategy.
> This is a little more efficient than  RDD.cartesian.
>
> // maropu
>
> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> It is basically a Cartesian join like RDBMS
>>
>> Example:
>>
>> SELECT * FROM FinancialCodes,  FinancialData
>>
>> The results of this query matches every row in the FinancialCodes table
>> with every row in the FinancialData table.  Each row consists of all
>> columns from the FinancialCodes table followed by all columns from the
>> FinancialData table.
>>
>>
>> Not very useful
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>> cartesian operation ?
>>>
>>> I am using spark 1.6.0 version
>>>
>>> Regards,
>>> Padma Ch
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Reply via email to