The Spark program already knows the partitions of the data and where they
exist; that's just defined by the data layout. It doesn't care what data is
inside. It knows partition 1 needs to be processed and if the task
processing it fails, needs to be run again. I'm not sure where you're
seeing data loss here? the data is already stored to begin with, not
somehow consumed and deleted.

On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
kalgaonkarsiddh...@gmail.com> wrote:

> Okay, so suppose I have 10 records distributed across 5 nodes and the
> partition of the first node holding 2 records failed. I understand that it
> will re-process this partition but how will it come to know that XYZ
> partition was holding XYZ data so that it will pick again only those
> records and reprocess it? In case of failure of a partition, is there a
> data loss? or is it stored somewhere?
>
> Maybe my question is very naive but I am trying to understand it in a
> better way.
>
> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote:
>
>> In that case, the file exists in parts across machines. No, tasks won't
>> re-read the whole file; no task does or can do that. Failed partitions are
>> reprocessed, but as in the first pass, the same partition is processed.
>>
>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
>> kalgaonkarsiddh...@gmail.com> wrote:
>>
>>> Hello team,
>>>
>>> I am aware that in case of memory issues when a task fails, it will try
>>> to restart 4 times since it is a default number and if it still fails then
>>> it will cause the entire job to fail.
>>>
>>> But suppose if I am reading a file that is distributed across nodes in
>>> partitions. So, what will happen if a partition fails that holds some data?
>>> Will it re-read the entire file and get that specific subset of data since
>>> the driver has the complete information? or will it copy the data to the
>>> other working nodes or tasks and try to run it?
>>>
>>

Reply via email to