Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Hi Xiao, that is the right attitude, thanks a ton :) Hi Kalin, https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes EMR latest version should be available right out of the box, perhaps you can raise a quick AWS ticket and find out in case its release it

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, @Enrico, I've added just the SQL query pages (+js dependencies etc.) in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. Thanks, Xiao Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > Hi Xiao, > >

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Hi, I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday. Also please ensure that you are using s3:// instead of s3a:// or anything like that. On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi Xiao, Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. Cheers, Xiao Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > Hi all, > > First of all let me say that I am pretty new to

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and