Hi Pierre, I'm setting parquet (and hdfs) block size like follows:
val ONE_GB = 1024 * 1024 * 1024 sc.hadoopConfiguration.setInt("dfs.blocksize", ONE_GB) sc.hadoopConfiguration.setInt("parquet.block.size", ONE_GB) Here, sc is a reference to the spark context. I've tested this and it works for me. Hopefully this helps resolve your memory issue. Good luck! Michael On Oct 9, 2014, at 8:43 AM, Pierre B <pierre.borckm...@realimpactanalytics.com> wrote: > Hi there! > > Is there a way to modify default parquet block size? > > I didn't see any reference to ParquetOutputFormat.setBlockSize in Spark code > so I was wondering if there was a way to provide this option? > > I'm asking because we are facing Out of Memory issues when writing parquet > files. > The rdd we are saving to parquet have a fairly high number of columns (in > the thousands, around 3k for the moment). > > The only way we can get rid of this for the moment is by doing a .coalesce > on the SchemaRDD before saving to parquet, but as we get more columns, even > this approach is not working. > > Any help is appreciated! > > Thanks > > Pierre > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/SQL-Set-Parquet-block-size-tp16039.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >