Split content into multiple Parquet files

2015-09-08 Thread Adrien Mogenet
Hi there,

We've spent several hours to split our input data into several parquet
files (or several folders, i.e.
/datasink/output-parquets//foobar.parquet), based on a low-cardinality
key. This works very well with a when using saveAsHadoopFile, but we can't
achieve a similar thing with Parquet files.

The only working solution so far is to persist the RDD and then loop over
it N times to write N files. That does not look acceptable...

Do you guys have any suggestion to do such an operation?

-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris


Re: Split content into multiple Parquet files

2015-09-08 Thread Cheng Lian

In Spark 1.4 and 1.5, you can do something like this:

df.write.partitionBy("key").parquet("/datasink/output-parquets")

BTW, I'm curious about how did you do it without partitionBy using 
saveAsHadoopFile?


Cheng

On 9/8/15 2:34 PM, Adrien Mogenet wrote:

Hi there,

We've spent several hours to split our input data into several parquet 
files (or several folders, i.e. 
/datasink/output-parquets//foobar.parquet), based on a 
low-cardinality key. This works very well with a when using 
saveAsHadoopFile, but we can't achieve a similar thing with Parquet files.


The only working solution so far is to persist the RDD and then loop 
over it N times to write N files. That does not look acceptable...


Do you guys have any suggestion to do such an operation?

--

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com 
(+33)6.59.16.64.22
http://www.contentsquare.com 
50, avenue Montaigne - 75008 Paris




Re: Split content into multiple Parquet files

2015-09-08 Thread Adrien Mogenet
My bad, I realized my question was unclear.

I did a partitionBy when using saveAsHadoopFile. My question was about
doing the same thing for Parquet file. We were using Spark 1.3.x, but now
that we've updated to 1.4.1 I totally forgot this makes things possible :-)

Thanks for the answer, then!

On 8 September 2015 at 12:58, Cheng Lian  wrote:

> In Spark 1.4 and 1.5, you can do something like this:
>
> df.write.partitionBy("key").parquet("/datasink/output-parquets")
>
> BTW, I'm curious about how did you do it without partitionBy using
> saveAsHadoopFile?
>
> Cheng
>
>
> On 9/8/15 2:34 PM, Adrien Mogenet wrote:
>
> Hi there,
>
> We've spent several hours to split our input data into several parquet
> files (or several folders, i.e.
> /datasink/output-parquets//foobar.parquet), based on a
> low-cardinality key. This works very well with a when using
> saveAsHadoopFile, but we can't achieve a similar thing with Parquet files.
>
> The only working solution so far is to persist the RDD and then loop over
> it N times to write N files. That does not look acceptable...
>
> Do you guys have any suggestion to do such an operation?
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.moge...@contentsquare.com
> (+33)6.59.16.64.22
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>
>
>


-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
(+33)6.59.16.64.22
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris