Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file per partition? And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? -Mike > On Jun 21, 2016, at 8:06 AM, Eugene Morozov

Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
Apurva, I'd say you have to apply repartition just once to the RDD that is union of all your files. And it has to be done right before you do anything else. If something is not needed on your files, then the sooner you project, the better. Hope, this helps. -- Be well! Jean Morozov On Tue,

Union of multiple RDDs

2016-06-21 Thread Apurva Nandan
Hello, I am trying to combine several small text files (each file is approx hundreds of MBs to 2-3 gigs) into one big parquet file. I am loading each one of them and trying to take a union, however this leads to enormous amounts of partitions, as union keeps on adding the partitions of the input