Hi Pierre,
I'm setting parquet (and hdfs) block size like follows:
val ONE_GB = 1024 * 1024 * 1024
sc.hadoopConfiguration.setInt("dfs.blocksize", ONE_GB)
sc.hadoopConfiguration.setInt("parquet.block.size", ONE_GB)
Here, sc is a reference to the spark context. I've tested this and it works for
me. Hopefully this helps resolve your memory issue.
Good luck!
Michael
On Oct 9, 2014, at 8:43 AM, Pierre B <[email protected]>
wrote:
> Hi there!
>
> Is there a way to modify default parquet block size?
>
> I didn't see any reference to ParquetOutputFormat.setBlockSize in Spark code
> so I was wondering if there was a way to provide this option?
>
> I'm asking because we are facing Out of Memory issues when writing parquet
> files.
> The rdd we are saving to parquet have a fairly high number of columns (in
> the thousands, around 3k for the moment).
>
> The only way we can get rid of this for the moment is by doing a .coalesce
> on the SchemaRDD before saving to parquet, but as we get more columns, even
> this approach is not working.
>
> Any help is appreciated!
>
> Thanks
>
> Pierre
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Set-Parquet-block-size-tp16039.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>