Thanks Ewan Leith. This seems like a good start, as it seem to match up to
the symptoms I am seeing :).

But, how do I specify "parquet.memory.pool.ratio"?
Parquet code seem to take this parameter from
ParquetOutputFormat.getRecordWriter()
(ref code: float maxLoadconf.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO,
MemoryManager.DEFAULT_MEMORY_POOL_RATIO);).
I wonder how is this provided thru Apache Spark. Meaning, I see that
'TaskAttemptContext' seems to be the hint to provide this. But I am not
able to find a way I could provide this configuration.

Please advice,
Muthu

On Wed, Jan 6, 2016 at 1:57 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> Hi Muthu, this could be related to a known issue in the release notes
>
> http://spark.apache.org/releases/spark-release-1-6-0.html
>
> Known issues
>
>     SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions
> may cause OOM; this can be worked around by decreasing the memory used by
> both Spark and Parquet using spark.memory.fraction (for example, 0.4) and
> parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g.
> setting it in core-site.xml).
>
> It's definitely worth setting spark.memory.fraction and
> parquet.memory.pool.ratio and trying again.
>
> Ewan
>
> -----Original Message-----
> From: babloo80 [mailto:bablo...@gmail.com]
> Sent: 06 January 2016 03:44
> To: user@spark.apache.org
> Subject: Out of memory issue
>
> Hello there,
>
> I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
> different stages of execution and creates a result parquet of 9 GB (about
> 27 million rows containing 165 columns. some columns are map based
> containing utmost 200 value histograms). The stages involve, Step 1:
> Reading the data using dataframe api Step 2: Transform dataframe to RDD (as
> the some of the columns are transformed into histograms (using empirical
> distribution to cap the number of keys) and some of them run like UDAF
> during reduce-by-key step) to perform and perform some transformations Step
> 3: Reduce the result by key so that the resultant can be used in the next
> stage for join Step 4: Perform left outer join of this result which runs
> similar Steps 1 thru 3.
> Step 5: The results are further reduced to be written to parquet
>
> With Apache Spark 1.5.2, I am able to run the job with no issues.
> Current env uses 8 nodes running a total of  320 cores, 100 GB executor
> memory per node with driver program using 32 GB. The approximate execution
> time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
> for read and eventual write of the result.
>
> When the same job is executed using Apache 1.6.0, some of the executor
> node's JVM gets restarted (with a new executor id). On further turning-on
> GC stats on the executor, the perm-gen seem to get maxed out and ends up
> showing the symptom of out-of-memory.
>
> Please advice on where to start investigating this issue.
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to