By repartition I think you mean coalesce() where you would get one parquet file per partition?
And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? -Mike > On Jun 21, 2016, at 8:06 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> > wrote: > > Apurva, > > I'd say you have to apply repartition just once to the RDD that is union of > all your files. > And it has to be done right before you do anything else. > > If something is not needed on your files, then the sooner you project, the > better. > > Hope, this helps. > > -- > Be well! > Jean Morozov > > On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3...@gmail.com > <mailto:apurva3...@gmail.com>> wrote: > Hello, > > I am trying to combine several small text files (each file is approx hundreds > of MBs to 2-3 gigs) into one big parquet file. > > I am loading each one of them and trying to take a union, however this leads > to enormous amounts of partitions, as union keeps on adding the partitions of > the input RDDs together. > > I also tried loading all the files via wildcard, but that behaves almost the > same as union i.e. generates a lot of partitions. > > One of the approach that I thought was to reparititon the rdd generated after > each union and then continue the process, but I don't know how efficient that > is. > > Has anyone came across this kind of thing before? > > - Apurva > > >