I've been experimenting with MlLib's BlockMatrix for distributed matrix multiplication but consistently run into problems with executors being killed due to memory constrains. The linked gist (here <https://gist.github.com/thomas9t/3cf914e4d4609df35ee60ce08f36421b>) has a short example of multiplying a 25,000 x 25,000 square matrix taking approximately 5G of disk with a vector (also stored as a BlockMatrix). I am running this on a 3 node (1 master + 2 workers) cluster on Amazon EMR using the m4.xlarge instance type. Each instance has 16GB of RAM and 4 CPU. The gist has detailed information about the Spark environment.
I have tried reducing the block size of the matrix, increasing the number of partitions in the underlying RDD, increasing defaultParallelism and increasing spark.yarn.executor.memoryOverhead (up to 3GB) - all without success. The input matrix should fit comfortably in distributed memory and the resulting matrix should be quite small (25,000 x 1) so I'm confused as to why Spark seems to want so much memory for this operation, and why Spark isn't spilling to disk here if it wants more memory. The job does eventually complete successfully, but for larger matrices stages have to be repeated several times which leads to long run times. I don't encounter any issues if I reduce the matrix size down to about 3GB. Can anyone with experience using MLLib's matrix operators provide any suggestions about what settings to look at, or what the hard constraints on memory for BlockMatrix multiplication are? Thanks, Anthony