Re: memory leak when saving Parquet files in Spark

Cheng Lian Thu, 10 Dec 2015 22:58:44 -0800

This is probably caused by schema merging. Were you using Spark 1.4 orearlier versions? Could you please try the following snippet to seewhether it helps:


df.write
  .format("parquet")
  .option("mergeSchema", "false")
  .partitionBy(partitionCols: _*)
  .mode(saveMode)
  .save(targetPath)


In 1.5, we've disabled schema merging by default.

Cheng

On 12/11/15 5:33 AM, Matt K wrote:

Hi all,
I have a process that's continuously saving data as Parquet withSpark. The bulk of the saving logic simply looks like this:
          df.write
            .format("parquet")
            .partitionBy(partitionCols: _*)
            .mode(saveMode).save(targetPath)
After running for a day or so, my process ran out of memory. I took amemory-dump. I see that a single thread is holding 32,189org.apache.parquet.hadoop.Footer objects, which in turn holdParquetMetadata. This is highly suspicious, since each threadprocesses under 1GB of data at a time, and there's usually no morethan 10 files in a single batch (no small file problem). So there maybe a memory leak somewhere in the saveAsParquet code-path.
I've attached a screen-shot from Eclipse MemoryAnalyzer showing theabove. Note 32,189 references.
A shot in the dark, but is there a way to disable ParquetMetadata filegeneration?
Thanks,
-Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: memory leak when saving Parquet files in Spark

Reply via email to