Re: Cartesian join on RDDs taking too much time

Jörn Franke Wed, 25 May 2016 03:52:40 -0700

What is the use case of this ? A Cartesian product is by definition slow in any 
system. Why do you need this? How long does your application take now?


> On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote:
> 
> I tried 
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even 
> this is taking too much time.
> 
> Thanks,
> Padma Ch
> 
>> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com> 
>> wrote:
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as 
>> parquet, orc, ...?
>> 
>> // maropu
>> 
>>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com> 
>>> wrote:
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I 
>>> am converting the joined dataframe to rdd (dataframe.rdd) and using 
>>> saveAsTextFile, trying to save it. However, this is also taking too much 
>>> time.
>>> 
>>> Thanks,
>>> Padma Ch
>>> 
>>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com> 
>>>> wrote:
>>>> Hi, 
>>>> 
>>>> Seems you'd be better off using DataFrame#join instead of  RDD.cartesian
>>>> because it always needs shuffle operations which have alot of overheads 
>>>> such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a 
>>>> broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>> 
>>>> // maropu
>>>> 
>>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh 
>>>>> <mich.talebza...@gmail.com> wrote:
>>>>> It is basically a Cartesian join like RDBMS 
>>>>> 
>>>>> Example:
>>>>> 
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>> 
>>>>> The results of this query matches every row in the FinancialCodes table 
>>>>> with every row in the FinancialData table.  Each row consists of all 
>>>>> columns from the FinancialCodes table followed by all columns from the 
>>>>> FinancialData table.
>>>>> 
>>>>> Not very useful 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>>>  
>>>>> 
>>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of size 7 
>>>>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in 
>>>>>> cartesian operation ?
>>>>>> 
>>>>>> I am using spark 1.6.0 version
>>>>>> 
>>>>>> Regards,
>>>>>> Padma Ch
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> ---
>>>> Takeshi Yamamuro
>> 
>> 
>> 
>> -- 
>> ---
>> Takeshi Yamamuro
>

Re: Cartesian join on RDDs taking too much time

Reply via email to