This is probably caused by schema merging. Were you using Spark 1.4 or
earlier versions? Could you please try the following snippet to see
whether it helps:
df.write
.format("parquet")
.option("mergeSchema", "false")
.partitionBy(partitionCols: _*)
.mode(saveMode)
.save(targetPath)
In 1.5, we've disabled schema merging by default.
Cheng
On 12/11/15 5:33 AM, Matt K wrote:
Hi all,
I have a process that's continuously saving data as Parquet with
Spark. The bulk of the saving logic simply looks like this:
df.write
.format("parquet")
.partitionBy(partitionCols: _*)
.mode(saveMode).save(targetPath)
After running for a day or so, my process ran out of memory. I took a
memory-dump. I see that a single thread is holding 32,189
org.apache.parquet.hadoop.Footer objects, which in turn hold
ParquetMetadata. This is highly suspicious, since each thread
processes under 1GB of data at a time, and there's usually no more
than 10 files in a single batch (no small file problem). So there may
be a memory leak somewhere in the saveAsParquet code-path.
I've attached a screen-shot from Eclipse MemoryAnalyzer showing the
above. Note 32,189 references.
A shot in the dark, but is there a way to disable ParquetMetadata file
generation?
Thanks,
-Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org