Both the driver (ApplicationMaster running on hadoop) and container
(CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation.

my code is something like

sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes,
not many though as each xml usually have zero results).reduceByKey(....
reducer ....).saveAsObjectFile()

Initially I had it with groupBy but that method uses a lot of resources
(according to the javadocs). Switching to reduceByKey didn't have any
effect.

Seems like spark goes into 2 cycles of calculations of ~270k of items. In
the 1st round, around 15GB of memory are used and that memory is not cleaned
up by a GC. That is true for both the driver and container. On the 2nd
round, it keeps on allocating memory till it runs out of it and yarn kills
it.

Any ideas?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to