Lets say i have rdd A of strings as {"hi","bye","ch"} and another RDD B of strings as {"padma","hihi","chch","priya"}. For every string rdd A i need to check the matches found in rdd B as such for string "hi" i have to check the matches against all strings in RDD B which means I need generate every possible combination right.. Hence generating cartesian product and then using map transformation on cartesian rdd I am trying to check the matches found.
Is there any better way I could do other than performaing cartesian. Till now application took 30 mins and on top of that I see executor lost issues. Thanks, Padma Ch On Wed, May 25, 2016 at 4:22 PM, Jörn Franke <jornfra...@gmail.com> wrote: > What is the use case of this ? A Cartesian product is by definition slow > in any system. Why do you need this? How long does your application take > now? > > On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote: > > I tried > dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even > this is taking too much time. > > Thanks, > Padma Ch > > On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com> > wrote: > >> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as >> parquet, orc, ...? >> >> // maropu >> >> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com> >> wrote: >> >>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs >>> .I am converting the joined dataframe to rdd (dataframe.rdd) and using >>> saveAsTextFile, trying to save it. However, this is also taking too much >>> time. >>> >>> Thanks, >>> Padma Ch >>> >>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> Seems you'd be better off using DataFrame#join instead of RDD >>>> .cartesian >>>> because it always needs shuffle operations which have alot of >>>> overheads such as reflection, serialization, ... >>>> In your case, since the smaller table is 7mb, DataFrame#join uses a >>>> broadcast strategy. >>>> This is a little more efficient than RDD.cartesian. >>>> >>>> // maropu >>>> >>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> It is basically a Cartesian join like RDBMS >>>>> >>>>> Example: >>>>> >>>>> SELECT * FROM FinancialCodes, FinancialData >>>>> >>>>> The results of this query matches every row in the FinancialCodes >>>>> table with every row in the FinancialData table. Each row consists >>>>> of all columns from the FinancialCodes table followed by all columns from >>>>> the FinancialData table. >>>>> >>>>> >>>>> Not very useful >>>>> >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> >>>>> >>>>> LinkedIn * >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>>> >>>>> >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>> >>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> I have two RDDs A and B where in A is of size 30 MB and B is of >>>>>> size 7 MB, A.cartesian(B) is taking too much time. Is there any >>>>>> bottleneck >>>>>> in cartesian operation ? >>>>>> >>>>>> I am using spark 1.6.0 version >>>>>> >>>>>> Regards, >>>>>> Padma Ch >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > >