Probably, because Spark prefers locality, but not necessarily.

On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
kalgaonkarsiddh...@gmail.com> wrote:

> Thank you so much for this information, Sean. One more question, that when
> it wants to re-run the failed partition, where does it run? On the same
> node or some other node?
>
>
> On Fri, 21 Jan 2022, 23:41 Sean Owen, <sro...@gmail.com> wrote:
>
>> The Spark program already knows the partitions of the data and where they
>> exist; that's just defined by the data layout. It doesn't care what data is
>> inside. It knows partition 1 needs to be processed and if the task
>> processing it fails, needs to be run again. I'm not sure where you're
>> seeing data loss here? the data is already stored to begin with, not
>> somehow consumed and deleted.
>>
>> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
>> kalgaonkarsiddh...@gmail.com> wrote:
>>
>>> Okay, so suppose I have 10 records distributed across 5 nodes and the
>>> partition of the first node holding 2 records failed. I understand that it
>>> will re-process this partition but how will it come to know that XYZ
>>> partition was holding XYZ data so that it will pick again only those
>>> records and reprocess it? In case of failure of a partition, is there a
>>> data loss? or is it stored somewhere?
>>>
>>> Maybe my question is very naive but I am trying to understand it in a
>>> better way.
>>>
>>> On Fri, Jan 21, 2022 at 11:32 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> In that case, the file exists in parts across machines. No, tasks won't
>>>> re-read the whole file; no task does or can do that. Failed partitions are
>>>> reprocessed, but as in the first pass, the same partition is processed.
>>>>
>>>> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
>>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>>
>>>>> Hello team,
>>>>>
>>>>> I am aware that in case of memory issues when a task fails, it will
>>>>> try to restart 4 times since it is a default number and if it still fails
>>>>> then it will cause the entire job to fail.
>>>>>
>>>>> But suppose if I am reading a file that is distributed across nodes in
>>>>> partitions. So, what will happen if a partition fails that holds some 
>>>>> data?
>>>>> Will it re-read the entire file and get that specific subset of data since
>>>>> the driver has the complete information? or will it copy the data to the
>>>>> other working nodes or tasks and try to run it?
>>>>>
>>>>

Reply via email to