Hi there !

Let's imagine I have a large number of very small dataframe with the same schema ( a list of DataFrames : allDFs)
and I want to create one large dataset with this.

I have been trying this :
-> allDFs.reduce ( (a,b) => a.union(b) )

And after this one :
-> allDFs.reduce ( (a,b) => a.union(b).repartition(200) )
to prevent df with large number of partitions


Two questions :
1) Will the reduce operation be done in parallel in the previous code ? or may be should I replace my reduce by allDFs.par.reduce ?
2) Is there a better way to concatenate them ?


Thanks !
Julio

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to