Hi, I am running a Spark Streaming Job. I was testing the fault tolerance by killing one of the workers using the kill -9 command.
What I understand is, when I kill a worker the process should not die and resume the execution. But, I am getting the following error and my process is halted. org.apache.spark.shuffle.FetchFailedException: Failed to connect to ..... Now, when I restart the same worker or (2 workers were running on the machine and I killed just one of them) then the execution resumes and the process is completed. Please help me in understanding why on a worker failure my process is not fault tolerant. Am I missing something ? Basically I need that my process resumes even if a worker is lost. Regards, Kundan