Interesting question! I think this goes back to the roots of Spark. You ask
"But suppose if I am reading a file that is distributed across nodes in
partitions. So, what will happen if a partition fails that holds some
data?". Assuming you mean the distributed file system that holds the file
suffers a failure in the node that hosts the said partition - here's
something that might explain this better:

Spark does not come with its own persistent distributed file storage
system. So Spark relies on the underlying file storage system to provide
resilience over input file failure. Commonly, an open source stack will see
HDFS (Hadoop Distributed File System) being the persistent distributed file
system that Spark reads from and writes back to. In that case, your
specific case will likely mean a failure of the datanode of HDFS that was
hosting the partition of data that failed.

Now, if the failure happens after Spark has completely read that partition
and is in the middle of the job, Spark will progress with the job
unhindered because it does not need to go back and re-read the data.
Remember, Spark will save (as RDDs) all intermediate states of the data
between stages and Spark stages continue to save the snippet of DAG from
one RDD to the next. So, Spark can recover from it's own node failures in
the intermediate stages by simply rerunning the DAGs from the last saved
RDD to recover to the next stage. The only case where Spark will need to
reach out to HDFS is if the very first Spark stage encounters a failure
before it has created the RDD for that partition.

In that specific edge case, Spark will reach out to HDFS to request the
failed HDFS block. At that point, if HDFS detects that the datanode hosting
that block is not responding, it will transparently redirect Spark to
another replica of the same block. So, the job will progress unhindered in
this case (perhaps a tad slower as the read may no longer be node-local).
Only in the extreme scenarios where HDFS has a catastrophic failure, all
the replicas are offline at that exact moment or the file was saved with
only 1 replica - will the Spark job fail as there is no way for it to
recover the said partitions. (In my rather long time working with Spark, I
have never come across this scenario yet).

Other distributed file systems behave similarly - e.g. Google Cloud Storage
or Amazon S3 will have slightly different nuances but will behave very
similarly to HDFS in this scenario.

So, for all practical purposes, it is safe to say Spark will progress the
job to completion in nearly all practical cases.

Regards,
Ranadip Chatterjee


On Fri, 21 Jan 2022 at 20:40, Sean Owen <sro...@gmail.com> wrote:

> Probably, because Spark prefers locality, but not necessarily.
>
> On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
> kalgaonkarsiddh...@gmail.com> wrote:
>
>> Thank you so much for this information, Sean. One more question, that
>> when it wants to re-run the failed partition, where does it run? On the
>> same node or some other node?
>>
>>
>> On Fri, 21 Jan 2022, 23:41 Sean Owen, <sro...@gmail.com> wrote:
>>
>>> The Spark program already knows the partitions of the data and where
>>> they exist; that's just defined by the data layout. It doesn't care what
>>> data is inside. It knows partition 1 needs to be processed and if the task
>>> processing it fails, needs to be run again. I'm not sure where you're
>>> seeing data loss here? the data is already stored to begin with, not
>>> somehow consumed and deleted.
>>>
>>> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>
>>>> Okay, so suppose I have 10 records distributed across 5 nodes and the
>>>> partition of the first node holding 2 records failed. I understand that it
>>>> will re-process this partition but how will it come to know that XYZ
>>>> partition was holding XYZ data so that it will pick again only those
>>>> records and reprocess it? In case of failure of a partition, is there a
>>>> data loss? or is it stored somewhere?
>>>>
>>>> Maybe my question is very naive but I am trying to understand it in a
>>>> better way.
>>>>
>>>> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> In that case, the file exists in parts across machines. No, tasks
>>>>> won't re-read the whole file; no task does or can do that. Failed
>>>>> partitions are reprocessed, but as in the first pass, the same partition 
>>>>> is
>>>>> processed.
>>>>>
>>>>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
>>>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>>>
>>>>>> Hello team,
>>>>>>
>>>>>> I am aware that in case of memory issues when a task fails, it will
>>>>>> try to restart 4 times since it is a default number and if it still fails
>>>>>> then it will cause the entire job to fail.
>>>>>>
>>>>>> But suppose if I am reading a file that is distributed across nodes
>>>>>> in partitions. So, what will happen if a partition fails that holds some
>>>>>> data? Will it re-read the entire file and get that specific subset of 
>>>>>> data
>>>>>> since the driver has the complete information? or will it copy the data 
>>>>>> to
>>>>>> the other working nodes or tasks and try to run it?
>>>>>>
>>>>>

Reply via email to