Hi,

Can you show the plans with explain(extended=true) for both versions?
That's where I'd start to pinpoint the issue. Perhaps the underlying
execution engine change to affect keyBy? Dunno and guessing...

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Fri, Feb 8, 2019 at 5:09 PM Pedro Tuero <tuerope...@gmail.com> wrote:

> I did a repartition to 10000 (hardcoded) before the keyBy and it ends in
> 1.2 minutes.
> The questions remain open, because I don't want to harcode paralellism.
>
> El vie., 8 de feb. de 2019 a la(s) 12:50, Pedro Tuero (
> tuerope...@gmail.com) escribió:
>
>> 128 is the default parallelism defined for the cluster.
>> The question now is why keyBy operation is using default parallelism
>> instead of the number of partition of the RDD created by the previous step
>> (5580).
>> Any clues?
>>
>> El jue., 7 de feb. de 2019 a la(s) 15:30, Pedro Tuero (
>> tuerope...@gmail.com) escribió:
>>
>>> Hi,
>>> I am running a job in spark (using aws emr) and some stages are taking a
>>> lot more using spark  2.4 instead of Spark 2.3.1:
>>>
>>> Spark 2.4:
>>> [image: image.png]
>>>
>>> Spark 2.3.1:
>>> [image: image.png]
>>>
>>> With Spark 2.4, the keyBy operation take more than 10X what it took with
>>> Spark 2.3.1
>>> It seems to be related to the number of tasks / partitions.
>>>
>>> Questions:
>>> - Is it not supposed that the number of task of a job is related to
>>> number of parts of the RDD left by the previous job? Did that change in
>>> version 2.4??
>>> - Which tools/ configuration may I try, to reduce this aberrant
>>> downgrade of performance??
>>>
>>> Thanks.
>>> Pedro.
>>>
>>

Reply via email to