What is the use case of this ? A Cartesian product is by definition slow in any system. Why do you need this? How long does your application take now?
> On 25 May 2016, at 12:42, Priya Ch <learnings.chitt...@gmail.com> wrote: > > I tried > dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even > this is taking too much time. > > Thanks, > Padma Ch > >> On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as >> parquet, orc, ...? >> >> // maropu >> >>> On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com> >>> wrote: >>> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs .I >>> am converting the joined dataframe to rdd (dataframe.rdd) and using >>> saveAsTextFile, trying to save it. However, this is also taking too much >>> time. >>> >>> Thanks, >>> Padma Ch >>> >>>> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com> >>>> wrote: >>>> Hi, >>>> >>>> Seems you'd be better off using DataFrame#join instead of RDD.cartesian >>>> because it always needs shuffle operations which have alot of overheads >>>> such as reflection, serialization, ... >>>> In your case, since the smaller table is 7mb, DataFrame#join uses a >>>> broadcast strategy. >>>> This is a little more efficient than RDD.cartesian. >>>> >>>> // maropu >>>> >>>>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh >>>>> <mich.talebza...@gmail.com> wrote: >>>>> It is basically a Cartesian join like RDBMS >>>>> >>>>> Example: >>>>> >>>>> SELECT * FROM FinancialCodes, FinancialData >>>>> >>>>> The results of this query matches every row in the FinancialCodes table >>>>> with every row in the FinancialData table. Each row consists of all >>>>> columns from the FinancialCodes table followed by all columns from the >>>>> FinancialData table. >>>>> >>>>> Not very useful >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> http://talebzadehmich.wordpress.com >>>>> >>>>> >>>>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote: >>>>>> Hi All, >>>>>> >>>>>> I have two RDDs A and B where in A is of size 30 MB and B is of size 7 >>>>>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in >>>>>> cartesian operation ? >>>>>> >>>>>> I am using spark 1.6.0 version >>>>>> >>>>>> Regards, >>>>>> Padma Ch >>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >> >> >> >> -- >> --- >> Takeshi Yamamuro >