Re: [SQL] Set Parquet block size?

Michael Allman Thu, 09 Oct 2014 19:30:35 -0700

Hi Pierre,

I'm setting parquet (and hdfs) block size like follows:


    val ONE_GB = 1024 * 1024 * 1024
    sc.hadoopConfiguration.setInt("dfs.blocksize", ONE_GB)
    sc.hadoopConfiguration.setInt("parquet.block.size", ONE_GB)

Here, sc is a reference to the spark context. I've tested this and it works for 
me. Hopefully this helps resolve your memory issue.

Good luck!

Michael

On Oct 9, 2014, at 8:43 AM, Pierre B <[email protected]> 
wrote:

> Hi there!
> 
> Is there a way to modify default parquet block size?
> 
> I didn't see any reference to ParquetOutputFormat.setBlockSize in Spark code
> so I was wondering if there was a way to provide this option?
> 
> I'm asking because we are facing Out of Memory issues when writing parquet
> files.
> The rdd we are saving to parquet have a fairly high number of columns (in
> the thousands, around 3k for the moment).
> 
> The only way we can get rid of this for the moment is by doing a .coalesce
> on the SchemaRDD before saving to parquet, but as we get more columns, even
> this approach is not working.
> 
> Any help is appreciated!
> 
> Thanks
> 
> Pierre 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Set-Parquet-block-size-tp16039.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

Re: [SQL] Set Parquet block size?

Reply via email to