Both the driver (ApplicationMaster running on hadoop) and container (CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation.
my code is something like sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes, not many though as each xml usually have zero results).reduceByKey(.... reducer ....).saveAsObjectFile() Initially I had it with groupBy but that method uses a lot of resources (according to the javadocs). Switching to reduceByKey didn't have any effect. Seems like spark goes into 2 cycles of calculations of ~270k of items. In the 1st round, around 15GB of memory are used and that memory is not cleaned up by a GC. That is true for both the driver and container. On the 2nd round, it keeps on allocating memory till it runs out of it and yarn kills it. Any ideas? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org