By repartition I think you mean coalesce() where you would get one parquet file 
per partition? 

And this would be a new immutable copy so that you would want to write this new 
RDD to a different HDFS directory? 

-Mike

> On Jun 21, 2016, at 8:06 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> 
> wrote:
> 
> Apurva, 
> 
> I'd say you have to apply repartition just once to the RDD that is union of 
> all your files.
> And it has to be done right before you do anything else.
> 
> If something is not needed on your files, then the sooner you project, the 
> better.
> 
> Hope, this helps.
> 
> --
> Be well!
> Jean Morozov
> 
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan <apurva3...@gmail.com 
> <mailto:apurva3...@gmail.com>> wrote:
> Hello,
> 
> I am trying to combine several small text files (each file is approx hundreds 
> of MBs to 2-3 gigs) into one big parquet file. 
> 
> I am loading each one of them and trying to take a union, however this leads 
> to enormous amounts of partitions, as union keeps on adding the partitions of 
> the input RDDs together.
> 
> I also tried loading all the files via wildcard, but that behaves almost the 
> same as union i.e. generates a lot of partitions.
> 
> One of the approach that I thought was to reparititon the rdd generated after 
> each union and then continue the process, but I don't know how efficient that 
> is.
> 
> Has anyone came across this kind of thing before?
> 
> - Apurva 
> 
> 
> 

Reply via email to