Re: Cartesian join on RDDs taking too much time

Priya Ch Wed, 25 May 2016 03:58:56 -0700

Lets say i have rdd A of strings as  {"hi","bye","ch"} and another RDD B of
strings as {"padma","hihi","chch","priya"}. For every string rdd A i need
to check the matches found in rdd B as such for string "hi" i have to check
the matches against all strings in RDD B which means I need generate every
possible combination right.. Hence generating cartesian product and then
 using map transformation on cartesian rdd I am trying to check the matches
found.


Is there any better way I could do other than performaing cartesian. Till
now application took 30 mins and on top of that I see executor lost issues.

Thanks,
Padma Ch

On Wed, May 25, 2016 at 4:22 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> What is the use case of this ? A Cartesian product is by definition slow
> in any system. Why do you need this? How long does your application take
> now?
>
> On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote:
>
> I tried
> dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even
> this is taking too much time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com>
> wrote:
>
>> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as
>> parquet, orc, ...?
>>
>> // maropu
>>
>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com>
>> wrote:
>>
>>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs
>>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using
>>> saveAsTextFile, trying to save it. However, this is also taking too much
>>> time.
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> Seems you'd be better off using DataFrame#join instead of  RDD
>>>> .cartesian
>>>> because it always needs shuffle operations which have alot of
>>>> overheads such as reflection, serialization, ...
>>>> In your case,  since the smaller table is 7mb, DataFrame#join uses a
>>>> broadcast strategy.
>>>> This is a little more efficient than  RDD.cartesian.
>>>>
>>>> // maropu
>>>>
>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> It is basically a Cartesian join like RDBMS
>>>>>
>>>>> Example:
>>>>>
>>>>> SELECT * FROM FinancialCodes,  FinancialData
>>>>>
>>>>> The results of this query matches every row in the FinancialCodes
>>>>> table with every row in the FinancialData table.  Each row consists
>>>>> of all columns from the FinancialCodes table followed by all columns from
>>>>> the FinancialData table.
>>>>>
>>>>>
>>>>> Not very useful
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>>   I have two RDDs A and B where in A is of size 30 MB and B is of
>>>>>> size 7 MB, A.cartesian(B) is taking too much time. Is there any 
>>>>>> bottleneck
>>>>>> in cartesian operation ?
>>>>>>
>>>>>> I am using spark 1.6.0 version
>>>>>>
>>>>>> Regards,
>>>>>> Padma Ch
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> ---
>>>> Takeshi Yamamuro
>>>>
>>>
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>
>

Re: Cartesian join on RDDs taking too much time

Reply via email to