+cloud-dataproc-discuss On Wed, May 25, 2022 at 12:33 AM Ranadip Chatterjee <ranadi...@gmail.com> wrote:
> To me, it seems like the data being processed on the 2 systems is not > identical. Can't think of any other reason why the single task stage will > get a different number of input records in the 2 cases. 700gb of input to a > single task is not good, and seems to be the bottleneck. > > On Wed, 25 May 2022, 06:32 Ori Popowski, <ori....@gmail.com> wrote: > >> Hi, >> >> Both jobs use spark.dynamicAllocation.enabled so there's no need to >> change the number of executors. There are 702 executors in the Dataproc >> cluster so this is not the problem. >> About number of partitions - this I didn't change and it's still 400. >> While writing this now, I am realising that I have more partitions than >> executors, but the same situation applies to EMR. >> >> I am observing 1 task in the final stage also on EMR. The difference is >> that on EMR that task receives 50K volume of data and on Dataproc it >> receives 700gb. I don't understand why it's happening. It can mean that the >> graph is different. But the job is exactly the same. Could it be because >> the minor version of Spark is different? >> >> On Wed, May 25, 2022 at 12:27 AM Ranadip Chatterjee <ranadi...@gmail.com> >> wrote: >> >>> Hi Ori, >>> >>> A single task for the final step can result from various scenarios like >>> an aggregate operation that results in only 1 value (e.g count) or a key >>> based aggregate with only 1 key for example. There could be other scenarios >>> as well. However, that would be the case in both EMR and Dataproc if the >>> same code is run on the same data in both cases. >>> >>> On a separate note, since you have now changed the size and number of >>> nodes, you may need to re-optimize the number and size of executors for the >>> job and perhaps the number of partitions as well to optimally use the >>> cluster resources. >>> >>> Regards, >>> Ranadip >>> >>> On Tue, 24 May 2022, 10:45 Ori Popowski, <ori....@gmail.com> wrote: >>> >>>> Hello >>>> >>>> I migrated a job from EMR with Spark 2.4.4 to Dataproc with Spark >>>> 2.4.8. I am creating a cluster with the exact same configuration, where the >>>> only difference is that the original cluster uses 78 workers with 96 CPUs >>>> and 768GiB memory each, and in the new cluster I am using 117 machines with >>>> 64 CPUs and 512GiB each, to achieve the same amount of resources in the >>>> cluster. >>>> >>>> The job is run with the same configuration (num of partitions, >>>> parallelism, etc.) and reads the same data. However, something strange >>>> happens and the job takes 20 hours. What I observed is that there is a >>>> stage where the driver instantiates a single task, and this task never >>>> starts because the shuffle of moving all the data to it takes forever. >>>> >>>> I also compared the runtime configuration and found some minor >>>> differences (due to Dataproc being different from EMR) but I haven't found >>>> any substantial difference. >>>> >>>> In other stages the cluster utilizes all the partitions (400), and it's >>>> not clear to me why it decides to invoke a single task. >>>> >>>> Can anyone provide an insight as to why such a thing would happen? >>>> >>>> Thanks >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> -- "...:::Aniket:::... Quetzalco@tl"