Hi, Can you show the plans with explain(extended=true) for both versions? That's where I'd start to pinpoint the issue. Perhaps the underlying execution engine change to affect keyBy? Dunno and guessing...
Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero <tuerope...@gmail.com> wrote: > I did a repartition to 10000 (hardcoded) before the keyBy and it ends in > 1.2 minutes. > The questions remain open, because I don't want to harcode paralellism. > > El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero ( > tuerope...@gmail.com) escribió: > >> 128 is the default parallelism defined for the cluster. >> The question now is why keyBy operation is using default parallelism >> instead of the number of partition of the RDD created by the previous step >> (5580). >> Any clues? >> >> El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero ( >> tuerope...@gmail.com) escribió: >> >>> Hi, >>> I am running a job in spark (using aws emr) and some stages are taking a >>> lot more using spark 2.4 instead of Spark 2.3.1: >>> >>> Spark 2.4: >>> [image: image.png] >>> >>> Spark 2.3.1: >>> [image: image.png] >>> >>> With Spark 2.4, the keyBy operation take more than 10X what it took with >>> Spark 2.3.1 >>> It seems to be related to the number of tasks / partitions. >>> >>> Questions: >>> - Is it not supposed that the number of task of a job is related to >>> number of parts of the RDD left by the previous job? Did that change in >>> version 2.4?? >>> - Which tools/ configuration may I try, to reduce this aberrant >>> downgrade of performance?? >>> >>> Thanks. >>> Pedro. >>> >>