Right, you already established a few times that the difference is the
number of partitions. Russell answered with what is almost surely the
correct answer, that it's AQE. In toy cases it isn't always a win.
Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up
more realistic workloads in general.

On Thu, Apr 8, 2021 at 8:52 AM maziyar <maziyar.pan...@iscpif.fr> wrote:

> So this is what I have in my Spark UI for 3.0.2 and 3.1.1: For
> pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): 
> Finished
> in 10 seconds For pyspark==3.1.1 (same stage "showString at
> NativeMethodAccessorImpl.java:0"): Finished the same stage in 39 seconds
> As you can see everything is literally the same between 3.0.2 and 3.1.1,
> number of stages, number of tasks, Input, Output, Shuffle Read, Shuffle
> Write, except the 3.0.2 runs all 12 tasks together while the 3.1.1 finishes
> 10/12 and the other 2 are the processing of the actual task which I shared
> previously: 3.1.1 3.0.2 PS: I have just made the same test in Databricks
> with 1 worker 8.1 (includes Apache Spark 3.1.1, Scala 2.12): 7.6
> (includes Apache Spark 3.0.1, Scala 2.12) There is still a difference,
> over 20 seconds which when it comes to the whole process being within a
> minute that is a big bump. Not sure what it is, but until further notice, I
> will advise our users to not use Spark/PySpark 3.1.1 locally or in
> Databricks. (there are other optimizations, maybe it's not noticeable, but
> this is such a simple code and it can become a bottleneck quickly in larger
> pipelines)
> ------------------------------
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>

Reply via email to