Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Hi Xiao, that is the right attitude, thanks a ton :) Hi Kalin, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes EMR latest version should be available right out of the box, perhaps you can raise a quick AWS ticket and find out in case its release it

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, @Enrico, I've added just the SQL query pages (+js dependencies etc.) in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. Thanks, Xiao Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > Hi Xiao, > >

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Hi, I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday. Also please ensure that you are using s3:// instead of s3a:// or anything like that. On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi Xiao, Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks Xiao, a more up to date publication in a conference like VLDB will certainly turn the the tide for many of us trying to defend Spark's Optimizer. On Wed, Jan 15, 2020 at 9:39 AM Xiao Li wrote: > In the upcoming Spark 3.0, we introduced a new framework for Adaptive > Query Execution in

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. Cheers, Xiao Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > Hi all, > > First of all let me say that I am pretty new to

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Xiao Li
In the upcoming Spark 3.0, we introduced a new framework for Adaptive Query Execution in Catalyst. This can adjust the plans based on the runtime statistics. This is missing in Calcite based on my understanding. Catalyst is also very easy to enhance. We also use the dynamic programming approach

Re: Why Apache Spark doesn't use Calcite?

2020-01-15 Thread Debajyoti Roy
Thanks all, and Matei. TL;DR of the conclusion for my particular case: Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima. Quantitatively, there is no