If you are reading all these datasets from files in persistent storage, 
functions like sc.textFile can take folders/patterns as input and read all of 
the files matching into the same RDD. Then you can convert it to a dataframe.

When you say it is time consuming with union, how are you measuring that? Did 
you try having all of them in one DF in comparison to having them broken down? 
Are you seeing a non-linear slowdown in operations after union with linear 
increase in data size?
Sent from my Windows 10 phone

From: Devi P.V<mailto:devip2...@gmail.com>
Sent: Tuesday, November 15, 2016 11:06 PM
To: user @spark<mailto:user@spark.apache.org>
Subject: what is the optimized way to combine multiple dataframes into one 
dataframe ?

Hi all,

I have 4 data frames with three columns,

client_id,product_id,interest

I want to combine these 4 dataframes into one dataframe.I used union like 
following

df1.union(df2).union(df3).union(df4)

But it is time consuming for bigdata.what is the optimized way for doing this 
using spark 2.0 & scala


Thanks

Reply via email to