I am testing the performance of Spark to see how it behaves when the dataset size exceeds the amount of memory available. I am running wordcount on a 4-node cluster (Intel Xeon 16 cores (32 threads), 256GB RAM per node). I limited spark.executor.memory to 64g, so I have 256g of memory available in the cluster. Wordcount fails due to connection errors during saveAsTextFile() when the input size is 1TB. I have tried experimenting with different timeouts, and akka frame sizes but the job is still failing. Are there any changes that I should make to get the job to run successfully?
Here is my most recent config. SPARK_WORKER_CORES=32 SPARK_WORKER_MEMORY=64g SPARK_WORKER_INSTANCES=1 SPARK_DAEMON_MEMORY=1g SPARK_JAVA_OPTS="-Dspark.executor.memory=64g -Dspark.default.parallelism=128 -Dspark.deploy.spreadOut=true -Dspark.storage.memoryFraction=0.5 -Dspark.shuffle.consolidateFiles=true -Dspark.akka.frameSize=200 -Dspark.akka.timeout=300 -Dspark.storage.blockManagerSlaveTimeoutMs=300000" Error logs: 14/03/17 13:07:52 WARN ExternalAppendOnlyMap: Spilling in-memory map of 584 MB to disk (1 time so far) 14/03/17 13:07:52 WARN ExternalAppendOnlyMap: Spilling in-memory map of 510 MB to disk (1 time so far) 14/03/17 13:08:03 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(node02,56673) 14/03/17 13:08:03 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(node02,56673) 14/03/17 13:08:03 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(node02,56673) 14/03/17 13:08:03 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@2c762242 14/03/17 13:08:03 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@7fc331db 14/03/17 13:08:03 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@220eecfa 14/03/17 13:08:03 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@286cca3 14/03/17 13:08:03 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(node02,56673) 14/03/17 13:08:03 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(node02,56673) 14/03/17 13:08:03 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(node02,56673) 14/03/17 13:08:03 INFO ConnectionManager: Notifying org.apache.spark.network.ConnectionManager$MessageStatus@2d1d6b0a 14/03/17 13:08:03 ERROR BlockFetcherIterator$BasicBlockFetcherIterator: Could not get block(s) from ConnectionManagerId(node02,56673)