Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

Ranadip Chatterjee Tue, 24 May 2022 14:27:50 -0700

Hi Ori,

A single task for the final step can result from various scenarios like an
aggregate operation that results in only 1 value (e.g count) or a key based
aggregate with only 1 key for example. There could be other scenarios as
well. However, that would be the case in both EMR and Dataproc if the same
code is run on the same data in both cases.


On a separate note, since you have now changed the size and number of
nodes, you may need to re-optimize the number and size of executors for the
job and perhaps the number of partitions as well to optimally use the
cluster resources.

Regards,
Ranadip

On Tue, 24 May 2022, 10:45 Ori Popowski, <ori....@gmail.com> wrote:

> Hello
>
> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark 2.4.8. I
> am creating a cluster with the exact same configuration, where the only
> difference is that the original cluster uses 78 workers with 96 CPUs and
> 768GiB memory each, and in the new cluster I am using 117 machines with 64
> CPUs and 512GiB each, to achieve the same amount of resources in the
> cluster.
>
> The job is run with the same configuration (num of partitions,
> parallelism, etc.) and reads the same data. However, something strange
> happens and the job takes 20 hours. What I observed is that there is a
> stage where the driver instantiates a single task, and this task never
> starts because the shuffle of moving all the data to it takes forever.
>
> I also compared the runtime configuration and found some minor differences
> (due to Dataproc being different from EMR) but I haven't found any
> substantial difference.
>
> In other stages the cluster utilizes all the partitions (400), and it's
> not clear to me why it decides to invoke a single task.
>
> Can anyone provide an insight as to why such a thing would happen?
>
> Thanks
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

Reply via email to