By repartition I think you mean coalesce() where you would get one parquet file
per partition?
And this would be a new immutable copy so that you would want to write this new
RDD to a different HDFS directory?
-Mike
> On Jun 21, 2016, at 8:06 AM, Eugene Morozov
Apurva,
I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.
If something is not needed on your files, then the sooner you project, the
better.
Hope, this helps.
--
Be well!
Jean Morozov
On Tue,
Hello,
I am trying to combine several small text files (each file is approx
hundreds of MBs to 2-3 gigs) into one big parquet file.
I am loading each one of them and trying to take a union, however this
leads to enormous amounts of partitions, as union keeps on adding the
partitions of the input