And this is happening in every job I run. It is not just one case. If I add a forced repartitions it works fine, even better than before. But I run the same code for different inputs so the number to make repartitions must be related to the input.
El mar., 12 de feb. de 2019 a la(s) 11:22, Pedro Tuero (tuerope...@gmail.com) escribió: > Hi Jacek. > I 'm not using SparkSql, I'm using RDD API directly. > I can confirm that the jobs and stages are the same on both executions. > In the environment tab of the web UI, when using spark 2.4 > spark.default.parallelism=128 is shown while in 2.3.1 is not. > But in 2.3.1 should be the same, because 128 is the number of cores of > cluster * 2 and it didn't change in the latest version. > > In the example I gave, 5580 is the number of parts left by a previous job > in S3, in Hadoop sequence files. So the initial RDD has 5580 partitions. > While in 2.3.1, RDDs that are created with transformations from the > initial RDD conserve the same number of partitions, in 2.4 the number of > partitions reset to default. > So RDD1, the product of the first mapToPair, prints 5580 when > getPartitions() is called in 2.3.1, while prints 128 in 2.4. > > Regards, > Pedro > > > El mar., 12 de feb. de 2019 a la(s) 09:13, Jacek Laskowski ( > ja...@japila.pl) escribió: > >> Hi, >> >> Can you show the plans with explain(extended=true) for both versions? >> That's where I'd start to pinpoint the issue. Perhaps the underlying >> execution engine change to affect keyBy? Dunno and guessing... >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> >> >> On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero <tuerope...@gmail.com> wrote: >> >>> I did a repartition to 10000 (hardcoded) before the keyBy and it ends in >>> 1.2 minutes. >>> The questions remain open, because I don't want to harcode paralellism. >>> >>> El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero ( >>> tuerope...@gmail.com) escribió: >>> >>>> 128 is the default parallelism defined for the cluster. >>>> The question now is why keyBy operation is using default parallelism >>>> instead of the number of partition of the RDD created by the previous step >>>> (5580). >>>> Any clues? >>>> >>>> El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero ( >>>> tuerope...@gmail.com) escribió: >>>> >>>>> Hi, >>>>> I am running a job in spark (using aws emr) and some stages are taking >>>>> a lot more using spark 2.4 instead of Spark 2.3.1: >>>>> >>>>> Spark 2.4: >>>>> [image: image.png] >>>>> >>>>> Spark 2.3.1: >>>>> [image: image.png] >>>>> >>>>> With Spark 2.4, the keyBy operation take more than 10X what it took >>>>> with Spark 2.3.1 >>>>> It seems to be related to the number of tasks / partitions. >>>>> >>>>> Questions: >>>>> - Is it not supposed that the number of task of a job is related to >>>>> number of parts of the RDD left by the previous job? Did that change in >>>>> version 2.4?? >>>>> - Which tools/ configuration may I try, to reduce this aberrant >>>>> downgrade of performance?? >>>>> >>>>> Thanks. >>>>> Pedro. >>>>> >>>>