I've been experimenting with MlLib's BlockMatrix for distributed matrix
multiplication but consistently run into problems with executors being
killed due to memory constrains. The linked gist (here
<https://gist.github.com/thomas9t/3cf914e4d4609df35ee60ce08f36421b>) has a
short example of multiplying a 25,000 x 25,000 square matrix taking
approximately 5G of disk with a vector (also stored as a BlockMatrix). I am
running this on a 3 node (1 master + 2 workers) cluster on Amazon EMR using
the m4.xlarge instance type. Each instance has 16GB of RAM and 4 CPU. The
gist has detailed information about the Spark environment.

I have tried reducing the block size of the matrix, increasing the number
of partitions in the underlying RDD, increasing defaultParallelism and
increasing spark.yarn.executor.memoryOverhead (up to 3GB) - all without
success. The input matrix should fit comfortably in distributed memory and
the resulting matrix should be quite small (25,000 x 1) so I'm confused as
to why Spark seems to want so much memory for this operation, and why Spark
isn't spilling to disk here if it wants more memory. The job does
eventually complete successfully, but for larger matrices stages have to be
repeated several times which leads to long run times. I don't encounter any
issues if I reduce the matrix size down to about 3GB. Can anyone with
experience using MLLib's matrix operators provide any suggestions about
what settings to look at, or what the hard constraints on memory for
BlockMatrix multiplication are?

Thanks,

Anthony

Reply via email to