+cloud-dataproc-discuss

On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee <ranadi...@gmail.com>
wrote:

> To me, it seems like the data being processed on the 2 systems is not
> identical. Can't think of any other reason why the single task stage will
> get a different number of input records in the 2 cases. 700gb of input to a
> single task is not good, and seems to be the bottleneck.
>
> On Wed, 25 May 2022, 06:32 Ori Popowski, <ori....@gmail.com> wrote:
>
>> Hi,
>>
>> Both jobs use spark.dynamicAllocation.enabled so there's no need to
>> change the number of executors. There are 702 executors in the Dataproc
>> cluster so this is not the problem.
>> About number of partitions - this I didn't change and it's still 400.
>> While writing this now, I am realising that I have more partitions than
>> executors, but the same situation applies to EMR.
>>
>> I am observing 1 task in the final stage also on EMR. The difference is
>> that on EMR that task receives 50K volume of data and on Dataproc it
>> receives 700gb. I don't understand why it's happening. It can mean that the
>> graph is different. But the job is exactly the same. Could it be because
>> the minor version of Spark is different?
>>
>> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee <ranadi...@gmail.com>
>> wrote:
>>
>>> Hi Ori,
>>>
>>> A single task for the final step can result from various scenarios like
>>> an aggregate operation that results in only 1 value (e.g count) or a key
>>> based aggregate with only 1 key for example. There could be other scenarios
>>> as well. However, that would be the case in both EMR and Dataproc if the
>>> same code is run on the same data in both cases.
>>>
>>> On a separate note, since you have now changed the size and number of
>>> nodes, you may need to re-optimize the number and size of executors for the
>>> job and perhaps the number of partitions as well to optimally use the
>>> cluster resources.
>>>
>>> Regards,
>>> Ranadip
>>>
>>> On Tue, 24 May 2022, 10:45 Ori Popowski, <ori....@gmail.com> wrote:
>>>
>>>> Hello
>>>>
>>>> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark
>>>> 2.4.8. I am creating a cluster with the exact same configuration, where the
>>>> only difference is that the original cluster uses 78 workers with 96 CPUs
>>>> and 768GiB memory each, and in the new cluster I am using 117 machines with
>>>> 64 CPUs and 512GiB each, to achieve the same amount of resources in the
>>>> cluster.
>>>>
>>>> The job is run with the same configuration (num of partitions,
>>>> parallelism, etc.) and reads the same data. However, something strange
>>>> happens and the job takes 20 hours. What I observed is that there is a
>>>> stage where the driver instantiates a single task, and this task never
>>>> starts because the shuffle of moving all the data to it takes forever.
>>>>
>>>> I also compared the runtime configuration and found some minor
>>>> differences (due to Dataproc being different from EMR) but I haven't found
>>>> any substantial difference.
>>>>
>>>> In other stages the cluster utilizes all the partitions (400), and it's
>>>> not clear to me why it decides to invoke a single task.
>>>>
>>>> Can anyone provide an insight as to why such a thing would happen?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

-- 
"...:::Aniket:::... Quetzalco@tl"

Reply via email to