Hello everyone, I am trying to compute the similarity between 550k objects using the DIMSUM algorithm available in Spark 1.6.
The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each. My input data is a 3.5GB CSV file hosted on AWS S3, which I use to build a RowMatrix with 550k columns and 550k rows, passing sparse vectors as rows to the RowMatrix constructor. At every attempt I've made so far the application fails during the /mapPartitionWithIndex/ stage of the /RowMatrix.columnSimilarities()/ method (source code at https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587 <https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587> ) with YARN containers 1) exiting with /FAILURE/ due to an /OutOfMemory/ exception on Java heap space (thanks to Spark, apparently) or 2) terminated by AM (and increasing /spark.yarn.executor.memoryOverhead/ as suggested doesn't seem to work). I tried and combined different approaches without noticing significant improvements: - setting AWS EMR maximizeResourceAllocation option to true (details at https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html <https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html> ) - increasing the number of partitions (via /spark.default.parallelism/, up to 8000) - increasing the driver and executor memory (respectively from default ~512M / ~5G to ~50G / ~15G) - increasing YARN memory overhead (from default 10% up to 40% of driver and executor memory, respectively) - setting the DIMSUM threshold to 0.5 and 0.8 to reduce the number of comparisons Anyone has any idea about the possible cause(s) of these errors? Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DIMSUM-among-550k-objects-on-AWS-Elastic-Map-Reduce-fails-with-OOM-errors-tp27038.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org