Hi Brett,

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer <brett.me...@crowdstrike.com>
wrote:

> I’m running a Python script with spark-submit on top of YARN on an EMR
> cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
> from S3, and then does some transformations and filtering, followed by some
> aggregate counts.  During Stage 2 of the job, everything looks to complete
> just fine with no executor failures or resubmissions, but when Stage 3
> starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
> Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
> successfully start.  The whole application eventually completes, but there
> is an addition of about 1+ hour overhead for all of the retries.
>
> I’m trying to determine why there were FetchFailure exceptions, since
> anything computed in the job that could not fit in the available memory
> cache should be by default spilled to disk for further retrieval.  I also
> see some "java.net.ConnectException: Connection refused” and
> "java.io.IOException: sendMessageReliably failed without being ACK’d"
> errors in the logs after a CancelledKeyException followed by
> a ClosedChannelException, but I have no idea why the nodes in the EMR
> cluster would suddenly stop being able to communicate.
>
> If anyone has ideas as to why the data needs to be rerun several times in
> this job, please let me know as I am fairly bewildered about this behavior.
>

Reply via email to