Hi all,

I am running a Spark Streaming job. It was able to produce the correct
results up to some time. Later on, the job was still running but producing
no result. I checked the Spark streaming UI and found that 4 tasks of a
stage failed.

The error messages showed that "Job aborted due to stage failure: Task 0 in
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0
(TID 400048, ip-172-31-13-130.ec2.internal): ExecutorLostFailure (executor
lost)
Driver stacktrace:"

I further clicked the stage and found 4 executors running the stages had
the error message:

ExecutorLostFailure (executor lost)


The stage that failed was actually runJob at ReceiverTracker.scala:275
<http://ec2-54-172-118-237.compute-1.amazonaws.com:9046/proxy/application_1415902783817_0019/stages/stage?id=2&attempt=0>,
which is the stage that keeps receiving message from Kafka. I guess that is
why the job does not produce results any more.

To investigate it, I logged into one of the executor machine and checked
the hadoop log. The log file contains a lot of exception message:

*java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )*


This streaming job is reading from Kafka and producing aggregation results.
After this stage failure, the job is still running but there is no data
shuffle as seen in the Spark UI.

The amount of the time this job can run correctly varies from job to job.
Does anyone has an idea why this Spark Streaming job had this exception?
And why it cannot recover from the stage failure?

Thanks!

Bill

Reply via email to