Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-18 Thread Maziyar Panahi
2:11 PM > To: Maziyar Panahi > Cc: User > Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x > > > Hi, > > Regarding your point: > > I won't be able to defend this request by telling Spark users the > previous major release was and still is

RE: Why is Spark 3.0.x faster than Spark 3.1.x

2021-05-17 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Maziyar, Mich Do we have any ticket to track this? Any idea if this is going to be fixed in 3.1.2? Thanks and Regards, Abhishek From: Mich Talebzadeh Sent: Friday, April 9, 2021 2:11 PM To: Maziyar Panahi Cc: User Subject: Re: Why is Spark 3.0.x faster than Spark 3.1.x Hi, Regarding

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-09 Thread Mich Talebzadeh
Hi, Regarding your point: I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest major release ... With the benefit of hindsight version 3.1.1 was released recently and the definition of stable (from a practical

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-09 Thread Maziyar Panahi
Thanks Mich, I will ask all of our users to use pyspark 3.0.x and will change all the notebooks/scripts to switch back from 3.1.1 to 3.0.2. That's being said, I won't be able to defend this request by telling Spark users the previous major release was and still is more stable than the latest m

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
Well the normal course of action (considering laws of diminishing returns) is that your mileage varies: Spark 3.0.1 is pretty stable and good enough. Unless there is an overriding reason why you have to use 3.1.1, you can set it aside and try it when you have other use cases. For now I guess you c

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Maziyar Panahi
I personally added the followings to my SparkSession in 3.1.1 and the result was exactly the same as before (local master). The 3.1.1 is still 4-5 times slower than 3.0.2 at least for that piece of code. I will do more investigation to see how it does with other stuff, especially anything withou

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
spark 3.1.1 I enabled the parameter spark_session.conf.set("spark.sql.adaptive.enabled", "true") to see it effects in yarn cluster mode, i.e spark-submit --master yarn --deploy-mode client with 4 executors it crashed the cluster. I then reduced the number of executors to 2 and this time it ra

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Maziyar Panahi
Thanks Sean, I have already tried adding that and the result is absolutely the same. The reason that config cannot be the reason (at least not alone) it's because my comparison is between Spark 3.0.2 and Spark 3.1.1. This config has been set to true the beginning of 3.0.0 and hasn't changed:

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Sean Owen
Right, you already established a few times that the difference is the number of partitions. Russell answered with what is almost surely the correct answer, that it's AQE. In toy cases it isn't always a win. Disable it if you need to. It's not a problem per se in 3.1; AQE speeds up more realistic wo

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
So this is what I have in my Spark UI for 3.0.2 and 3.1.1:For pyspark==3.0.2 (stage "showString at NativeMethodAccessorImpl.java:0"): Finished in 10 secondsFor pyspark==3.1.1 (same stage "showString at

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
Hi Mich, Thanks for the reply. I have tried to minimize as much a possible the effect of other factors between pyspark==3.0.2 and pyspark==3.1.1 including not reading csv or gz and just reading the Parquet. Here is a code purely in pyspark (nothing else included) and it finishes within 47 second

Re: Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread Mich Talebzadeh
Hi, I just looked through your code 1. I assume you are testing this against spark 3.1.1? 2. You are testing this set-up in a local mode in a single JVM, so it is not really distributed. I doubt whether a meaningful performance deduction can be made here 3. It is accepted that new

Why is Spark 3.0.x faster than Spark 3.1.x

2021-04-08 Thread maziyar
Hi, I have a simple code that does a groupby, agg count, sort, etc. This code finishes within 5 minutes on Spark 3.1.x. However, the same code, same dataset, same SparkSession (configs) on Spark 3.0.2 will finish within a minute. That is over 5x times the difference. My SparkSession (same when it