Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file 
per partition? 

And this would be a new immutable copy so that you would want to write this new 
RDD to a different HDFS directory? 

-Mike

> On Jun 21, 2016, at 8:06 AM, Eugene Morozov  
> wrote:
> 
> Apurva, 
> 
> I'd say you have to apply repartition just once to the RDD that is union of 
> all your files.
> And it has to be done right before you do anything else.
> 
> If something is not needed on your files, then the sooner you project, the 
> better.
> 
> Hope, this helps.
> 
> --
> Be well!
> Jean Morozov
> 
> On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan  > wrote:
> Hello,
> 
> I am trying to combine several small text files (each file is approx hundreds 
> of MBs to 2-3 gigs) into one big parquet file. 
> 
> I am loading each one of them and trying to take a union, however this leads 
> to enormous amounts of partitions, as union keeps on adding the partitions of 
> the input RDDs together.
> 
> I also tried loading all the files via wildcard, but that behaves almost the 
> same as union i.e. generates a lot of partitions.
> 
> One of the approach that I thought was to reparititon the rdd generated after 
> each union and then continue the process, but I don't know how efficient that 
> is.
> 
> Has anyone came across this kind of thing before?
> 
> - Apurva 
> 
> 
> 



Re: Union of multiple RDDs

2016-06-21 Thread Eugene Morozov
Apurva,

I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.

If something is not needed on your files, then the sooner you project, the
better.

Hope, this helps.

--
Be well!
Jean Morozov

On Tue, Jun 21, 2016 at 4:48 PM, Apurva Nandan  wrote:

> Hello,
>
> I am trying to combine several small text files (each file is approx
> hundreds of MBs to 2-3 gigs) into one big parquet file.
>
> I am loading each one of them and trying to take a union, however this
> leads to enormous amounts of partitions, as union keeps on adding the
> partitions of the input RDDs together.
>
> I also tried loading all the files via wildcard, but that behaves almost
> the same as union i.e. generates a lot of partitions.
>
> One of the approach that I thought was to reparititon the rdd generated
> after each union and then continue the process, but I don't know how
> efficient that is.
>
> Has anyone came across this kind of thing before?
>
> - Apurva
>
>
>