I¹m running a Python script with spark-submit on top of YARN on an EMR cluster with 30 nodes. The script reads in approximately 3.9 TB of data from S3, and then does some transformations and filtering, followed by some aggregate counts. During Stage 2 of the job, everything looks to complete just fine with no executor failures or resubmissions, but when Stage 3 starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors. Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully start. The whole application eventually completes, but there is an addition of about 1+ hour overhead for all of the retries.
I¹m trying to determine why there were FetchFailure exceptions, since anything computed in the job that could not fit in the available memory cache should be by default spilled to disk for further retrieval. I also see some "java.net.ConnectException: Connection refused² and "java.io.IOException: sendMessageReliably failed without being ACK¹d" errors in the logs after a CancelledKeyException followed by a ClosedChannelException, but I have no idea why the nodes in the EMR cluster would suddenly stop being able to communicate. If anyone has ideas as to why the data needs to be rerun several times in this job, please let me know as I am fairly bewildered about this behavior.
smime.p7s
Description: S/MIME cryptographic signature