Thanks Ewan Leith. This seems like a good start, as it seem to match up to the symptoms I am seeing :).
But, how do I specify "parquet.memory.pool.ratio"? Parquet code seem to take this parameter from ParquetOutputFormat.getRecordWriter() (ref code: float maxLoadconf.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO, MemoryManager.DEFAULT_MEMORY_POOL_RATIO);). I wonder how is this provided thru Apache Spark. Meaning, I see that 'TaskAttemptContext' seems to be the hint to provide this. But I am not able to find a way I could provide this configuration. Please advice, Muthu On Wed, Jan 6, 2016 at 1:57 AM, Ewan Leith <ewan.le...@realitymine.com> wrote: > Hi Muthu, this could be related to a known issue in the release notes > > http://spark.apache.org/releases/spark-release-1-6-0.html > > Known issues > > SPARK-12546 - Save DataFrame/table as Parquet with dynamic partitions > may cause OOM; this can be worked around by decreasing the memory used by > both Spark and Parquet using spark.memory.fraction (for example, 0.4) and > parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g. > setting it in core-site.xml). > > It's definitely worth setting spark.memory.fraction and > parquet.memory.pool.ratio and trying again. > > Ewan > > -----Original Message----- > From: babloo80 [mailto:bablo...@gmail.com] > Sent: 06 January 2016 03:44 > To: user@spark.apache.org > Subject: Out of memory issue > > Hello there, > > I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in > different stages of execution and creates a result parquet of 9 GB (about > 27 million rows containing 165 columns. some columns are map based > containing utmost 200 value histograms). The stages involve, Step 1: > Reading the data using dataframe api Step 2: Transform dataframe to RDD (as > the some of the columns are transformed into histograms (using empirical > distribution to cap the number of keys) and some of them run like UDAF > during reduce-by-key step) to perform and perform some transformations Step > 3: Reduce the result by key so that the resultant can be used in the next > stage for join Step 4: Perform left outer join of this result which runs > similar Steps 1 thru 3. > Step 5: The results are further reduced to be written to parquet > > With Apache Spark 1.5.2, I am able to run the job with no issues. > Current env uses 8 nodes running a total of 320 cores, 100 GB executor > memory per node with driver program using 32 GB. The approximate execution > time is about 1.2 hrs. The parquet files are stored in another HDFS cluster > for read and eventual write of the result. > > When the same job is executed using Apache 1.6.0, some of the executor > node's JVM gets restarted (with a new executor id). On further turning-on > GC stats on the executor, the perm-gen seem to get maxed out and ends up > showing the symptom of out-of-memory. > > Please advice on where to start investigating this issue. > > Thanks, > Muthu > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >