Re: Cartesian join on RDDs taking too much time

Priya Ch Wed, 25 May 2016 03:43:02 -0700

I tried
dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
this is taking too much time.


Thanks,
Padma Ch

On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
> parquet, orc, ...?
>
> // maropu
>
> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>> saveAsTextFile, trying to save it. However, this is also taking too much
>> time.
>>
>> Thanks,
>> Padma Ch
>>
>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>> because it always needs shuffle operations which have alot of overheads
>>> such as reflection, serialization, ...
>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>> broadcast strategy.
>>> This is a little more efficient than  RDD.cartesian.
>>>
>>> // maropu
>>>
>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> It is basically a Cartesian join like RDBMS
>>>>
>>>> Example:
>>>>
>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>
>>>> The results of this query matches every row in the FinancialCodes table
>>>> with every row in the FinancialData table.  Each row consists of all
>>>> columns from the FinancialCodes table followed by all columns from the
>>>> FinancialData table.
>>>>
>>>>
>>>> Not very useful
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size
>>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>>>>> cartesian operation ?
>>>>>
>>>>> I am using spark 1.6.0 version
>>>>>
>>>>> Regards,
>>>>> Padma Ch
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Reply via email to