Just couple of points to add:

1. "partition" is more of a logical construct so partitions can not fail. A
task which is reading from persistent storage to RDD can fail, and thus can
be rerun to reprocess the partition. What is Ranadip mentioned above is
true, with a caveat that data will be actually be read in memory only after
an action is encountered, everything prior to that is logical plan, and
because of Spark's lazy loading data wont be materialized until it really
requires it

2. Between storage and tasks, there is a concept called splits. Each task
actually run on splits and there are various ways to define the splits. In
practice, splits and partitions can be 1:1 mapped, but I think there are
various strategies can be defined to map splits to partitions.

HTH.....

On Mon, Jan 24, 2022 at 7:39 AM Ranadip Chatterjee <ranadi...@gmail.com>
wrote:

> Interesting question! I think this goes back to the roots of Spark. You
> ask "But suppose if I am reading a file that is distributed across nodes in
> partitions. So, what will happen if a partition fails that holds some
> data?". Assuming you mean the distributed file system that holds the file
> suffers a failure in the node that hosts the said partition - here's
> something that might explain this better:
>
> Spark does not come with its own persistent distributed file storage
> system. So Spark relies on the underlying file storage system to provide
> resilience over input file failure. Commonly, an open source stack will see
> HDFS (Hadoop Distributed File System) being the persistent distributed file
> system that Spark reads from and writes back to. In that case, your
> specific case will likely mean a failure of the datanode of HDFS that was
> hosting the partition of data that failed.
>
> Now, if the failure happens after Spark has completely read that partition
> and is in the middle of the job, Spark will progress with the job
> unhindered because it does not need to go back and re-read the data.
> Remember, Spark will save (as RDDs) all intermediate states of the data
> between stages and Spark stages continue to save the snippet of DAG from
> one RDD to the next. So, Spark can recover from it's own node failures in
> the intermediate stages by simply rerunning the DAGs from the last saved
> RDD to recover to the next stage. The only case where Spark will need to
> reach out to HDFS is if the very first Spark stage encounters a failure
> before it has created the RDD for that partition.
>
> In that specific edge case, Spark will reach out to HDFS to request the
> failed HDFS block. At that point, if HDFS detects that the datanode hosting
> that block is not responding, it will transparently redirect Spark to
> another replica of the same block. So, the job will progress unhindered in
> this case (perhaps a tad slower as the read may no longer be node-local).
> Only in the extreme scenarios where HDFS has a catastrophic failure, all
> the replicas are offline at that exact moment or the file was saved with
> only 1 replica - will the Spark job fail as there is no way for it to
> recover the said partitions. (In my rather long time working with Spark, I
> have never come across this scenario yet).
>
> Other distributed file systems behave similarly - e.g. Google Cloud
> Storage or Amazon S3 will have slightly different nuances but will behave
> very similarly to HDFS in this scenario.
>
> So, for all practical purposes, it is safe to say Spark will progress the
> job to completion in nearly all practical cases.
>
> Regards,
> Ranadip Chatterjee
>
>
> On Fri, 21 Jan 2022 at 20:40, Sean Owen <sro...@gmail.com> wrote:
>
>> Probably, because Spark prefers locality, but not necessarily.
>>
>> On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
>> kalgaonkarsiddh...@gmail.com> wrote:
>>
>>> Thank you so much for this information, Sean. One more question, that
>>> when it wants to re-run the failed partition, where does it run? On the
>>> same node or some other node?
>>>
>>>
>>> On Fri, 21 Jan 2022, 23:41 Sean Owen, <sro...@gmail.com> wrote:
>>>
>>>> The Spark program already knows the partitions of the data and where
>>>> they exist; that's just defined by the data layout. It doesn't care what
>>>> data is inside. It knows partition 1 needs to be processed and if the task
>>>> processing it fails, needs to be run again. I'm not sure where you're
>>>> seeing data loss here? the data is already stored to begin with, not
>>>> somehow consumed and deleted.
>>>>
>>>> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
>>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>>
>>>>> Okay, so suppose I have 10 records distributed across 5 nodes and the
>>>>> partition of the first node holding 2 records failed. I understand that it
>>>>> will re-process this partition but how will it come to know that XYZ
>>>>> partition was holding XYZ data so that it will pick again only those
>>>>> records and reprocess it? In case of failure of a partition, is there a
>>>>> data loss? or is it stored somewhere?
>>>>>
>>>>> Maybe my question is very naive but I am trying to understand it in a
>>>>> better way.
>>>>>
>>>>> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>
>>>>>> In that case, the file exists in parts across machines. No, tasks
>>>>>> won't re-read the whole file; no task does or can do that. Failed
>>>>>> partitions are reprocessed, but as in the first pass, the same partition 
>>>>>> is
>>>>>> processed.
>>>>>>
>>>>>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
>>>>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello team,
>>>>>>>
>>>>>>> I am aware that in case of memory issues when a task fails, it will
>>>>>>> try to restart 4 times since it is a default number and if it still 
>>>>>>> fails
>>>>>>> then it will cause the entire job to fail.
>>>>>>>
>>>>>>> But suppose if I am reading a file that is distributed across nodes
>>>>>>> in partitions. So, what will happen if a partition fails that holds some
>>>>>>> data? Will it re-read the entire file and get that specific subset of 
>>>>>>> data
>>>>>>> since the driver has the complete information? or will it copy the data 
>>>>>>> to
>>>>>>> the other working nodes or tasks and try to run it?
>>>>>>>
>>>>>>

-- 
Best Regards,
Ayan Guha

Reply via email to