Hi Brett, Are you noticing executors dying? Are you able to check the YARN NodeManager logs and see whether YARN is killing them for exceeding memory limits?
-Sandy On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <brett.me...@crowdstrike.com> wrote: > I’m running a Python script with spark-submit on top of YARN on an EMR > cluster with 30 nodes. The script reads in approximately 3.9 TB of data > from S3, and then does some transformations and filtering, followed by some > aggregate counts. During Stage 2 of the job, everything looks to complete > just fine with no executor failures or resubmissions, but when Stage 3 > starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors. > Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can > successfully start. The whole application eventually completes, but there > is an addition of about 1+ hour overhead for all of the retries. > > I’m trying to determine why there were FetchFailure exceptions, since > anything computed in the job that could not fit in the available memory > cache should be by default spilled to disk for further retrieval. I also > see some "java.net.ConnectException: Connection refused” and > "java.io.IOException: sendMessageReliably failed without being ACK’d" > errors in the logs after a CancelledKeyException followed by > a ClosedChannelException, but I have no idea why the nodes in the EMR > cluster would suddenly stop being able to communicate. > > If anyone has ideas as to why the data needs to be rerun several times in > this job, please let me know as I am fairly bewildered about this behavior. >