This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps:

df.write
  .format("parquet")
  .option("mergeSchema", "false")
  .partitionBy(partitionCols: _*)
  .mode(saveMode)
  .save(targetPath)

In 1.5, we've disabled schema merging by default.

Cheng

On 12/11/15 5:33 AM, Matt K wrote:
Hi all,

I have a process that's continuously saving data as Parquet with Spark. The bulk of the saving logic simply looks like this:

          df.write
            .format("parquet")
            .partitionBy(partitionCols: _*)
            .mode(saveMode).save(targetPath)

After running for a day or so, my process ran out of memory. I took a memory-dump. I see that a single thread is holding 32,189 org.apache.parquet.hadoop.Footer objects, which in turn hold ParquetMetadata. This is highly suspicious, since each thread processes under 1GB of data at a time, and there's usually no more than 10 files in a single batch (no small file problem). So there may be a memory leak somewhere in the saveAsParquet code-path.

I've attached a screen-shot from Eclipse MemoryAnalyzer showing the above. Note 32,189 references.

A shot in the dark, but is there a way to disable ParquetMetadata file generation?

Thanks,
-Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to