Hi there !
Let's imagine I have a large number of very small dataframe with the
same schema ( a list of DataFrames : allDFs)
and I want to create one large dataset with this.
I have been trying this :
-> allDFs.reduce ( (a,b) => a.union(b) )
And after this one :
-> allDFs.reduce ( (a,b) => a.union(b).repartition(200) )
to prevent df with large number of partitions
Two questions :
1) Will the reduce operation be done in parallel in the previous code ?
or may be should I replace my reduce by allDFs.par.reduce ?
2) Is there a better way to concatenate them ?
Thanks !
Julio
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org