If you are reading all these datasets from files in persistent storage, functions like sc.textFile can take folders/patterns as input and read all of the files matching into the same RDD. Then you can convert it to a dataframe.
When you say it is time consuming with union, how are you measuring that? Did you try having all of them in one DF in comparison to having them broken down? Are you seeing a non-linear slowdown in operations after union with linear increase in data size? Sent from my Windows 10 phone From: Devi P.V<mailto:devip2...@gmail.com> Sent: Tuesday, November 15, 2016 11:06 PM To: user @spark<mailto:user@spark.apache.org> Subject: what is the optimized way to combine multiple dataframes into one dataframe ? Hi all, I have 4 data frames with three columns, client_id,product_id,interest I want to combine these 4 dataframes into one dataframe.I used union like following df1.union(df2).union(df3).union(df4) But it is time consuming for bigdata.what is the optimized way for doing this using spark 2.0 & scala Thanks