Hi,
I am running a job in spark (using aws emr) and some stages are taking a
lot more using spark  2.4 instead of Spark 2.3.1:

Spark 2.4:
[image: image.png]

Spark 2.3.1:
[image: image.png]

With Spark 2.4, the keyBy operation take more than 10X what it took with
Spark 2.3.1
It seems to be related to the number of tasks / partitions.

Questions:
- Is it not supposed that the number of task of a job is related to number
of parts of the RDD left by the previous job? Did that change in version
2.4??
- Which tools/ configuration may I try, to reduce this aberrant downgrade
of performance??

Thanks.
Pedro.

Reply via email to