DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors

nmoretto Fri, 27 May 2016 03:39:08 -0700

Hello everyone, I am trying to compute the similarity between 550k objects
using the DIMSUM algorithm available in Spark 1.6.


The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge
instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each.

My input data is a 3.5GB CSV file hosted on AWS S3, which I use to build a
RowMatrix with 550k columns and 550k rows, passing sparse vectors as rows to
the RowMatrix constructor.

At every attempt I've made so far the application fails during the
/mapPartitionWithIndex/ stage of the /RowMatrix.columnSimilarities()/ method
(source code at 
https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587
<https://github.com/apache/spark/blob/v1.6.0/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L587>
 
) with YARN containers 1) exiting with /FAILURE/ due to an /OutOfMemory/
exception on Java heap space (thanks to Spark, apparently) or 2) terminated
by AM (and increasing /spark.yarn.executor.memoryOverhead/ as suggested
doesn't seem to work).

I tried and combined different approaches without noticing significant
improvements:
- setting AWS EMR maximizeResourceAllocation option to true (details at 
https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
<https://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html>
 
)
- increasing the number of partitions (via /spark.default.parallelism/, up
to 8000)
- increasing the driver and executor memory (respectively from default ~512M
/ ~5G to ~50G / ~15G)
- increasing YARN memory overhead (from default 10% up to 40% of driver and
executor memory, respectively)
- setting the DIMSUM threshold to 0.5 and 0.8 to reduce the number of
comparisons

Anyone has any idea about the possible cause(s) of these errors? Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DIMSUM-among-550k-objects-on-AWS-Elastic-Map-Reduce-fails-with-OOM-errors-tp27038.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors

Reply via email to