I tried dataframe.write.format("com.databricks.spark.csv").save("/hdfs_path"). Even this is taking too much time.
Thanks, Padma Ch On Wed, May 25, 2016 at 3:47 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Why did you use Rdd#saveAsTextFile instead of DataFrame#save writing as > parquet, orc, ...? > > // maropu > > On Wed, May 25, 2016 at 7:10 PM, Priya Ch <learnings.chitt...@gmail.com> > wrote: > >> Hi , Yes I have joined using DataFrame join. Now to save this into hdfs >> .I am converting the joined dataframe to rdd (dataframe.rdd) and using >> saveAsTextFile, trying to save it. However, this is also taking too much >> time. >> >> Thanks, >> Padma Ch >> >> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro <linguin....@gmail.com> >> wrote: >> >>> Hi, >>> >>> Seems you'd be better off using DataFrame#join instead of RDD.cartesian >>> because it always needs shuffle operations which have alot of overheads >>> such as reflection, serialization, ... >>> In your case, since the smaller table is 7mb, DataFrame#join uses a >>> broadcast strategy. >>> This is a little more efficient than RDD.cartesian. >>> >>> // maropu >>> >>> On Wed, May 25, 2016 at 4:20 PM, Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> It is basically a Cartesian join like RDBMS >>>> >>>> Example: >>>> >>>> SELECT * FROM FinancialCodes, FinancialData >>>> >>>> The results of this query matches every row in the FinancialCodes table >>>> with every row in the FinancialData table. Each row consists of all >>>> columns from the FinancialCodes table followed by all columns from the >>>> FinancialData table. >>>> >>>> >>>> Not very useful >>>> >>>> >>>> Dr Mich Talebzadeh >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> http://talebzadehmich.wordpress.com >>>> >>>> >>>> >>>> On 25 May 2016 at 08:05, Priya Ch <learnings.chitt...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I have two RDDs A and B where in A is of size 30 MB and B is of size >>>>> 7 MB, A.cartesian(B) is taking too much time. Is there any bottleneck in >>>>> cartesian operation ? >>>>> >>>>> I am using spark 1.6.0 version >>>>> >>>>> Regards, >>>>> Padma Ch >>>>> >>>> >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> > > > -- > --- > Takeshi Yamamuro >