If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases.
Thanks, Xiao Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午10:53写道: > Hi Xiao, > > Thanks, I didn't know that. This > https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ > implies that their fork is not used in emr 5.27. I tried that and it has > the same issue. But then again in their article they were comparing emr > 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest > version of Spark locally and make the comparison that way. > > Regards, > Kalin > > On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <gatorsm...@gmail.com> wrote: > >> EMR is having their own fork of Spark, called EMR runtime. They are not >> Apache Spark. You might need to talk with them instead of posting questions >> in the Apache Spark community. >> >> Cheers, >> >> Xiao >> >> Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午9:53写道: >> >>> Hi all, >>> >>> First of all let me say that I am pretty new to Spark so this could be >>> entirely my fault somehow... >>> I noticed this when I was running a job on an amazon emr cluster with >>> Spark 2.4.4, and it got done slower than when I had ran it locally (on >>> Spark 2.4.1). I checked out the event logs, and the one from the newer >>> version had more stages. >>> Then I decided to do a comparison in the same environment so I created >>> the two versions of the same cluster with the only difference being the emr >>> release, and hence the spark version(?) - first one was emr-5.24.1 with >>> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, >>> the same thing happened with the newer version having more stages and >>> taking almost twice as long to finish. >>> So I am pretty much at a loss here - could it be that it is not because >>> of spark itself, but because of some difference introduced in the emr >>> releases? At the moment I can't think of any other alternative besides it >>> being a bug... >>> >>> Here are the two event logs: >>> >>> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing >>> and my code is here: >>> https://github.com/kgskgs/stars-spark3d >>> >>> I ran it like so on the clusters (after putting it on s3): >>> spark-submit --deploy-mode cluster --py-files >>> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py >>> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 >>> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ >>> >>> So yeah I was considering submitting a bug report, but in the guide it >>> said it's better to ask here first, so any ideas on what's going on? Maybe >>> I am missing something? >>> >>> Regards, >>> Kalin >>> >>