Hi, I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday.
Also please ensure that you are using s3:// instead of s3a:// or anything like that. On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of users use SPARK in Databricks, GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have always found that the Apache SPARK community takes care of each other and answers questions to the largest user base, just like I did now. I think that only Matei Zaharia can take such a sweeping call on what this entire community is about. Thanks and Regards, Gourav Sengupta On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov <kgs.v...@gmail.com> wrote: > Hi all, > > First of all let me say that I am pretty new to Spark so this could be > entirely my fault somehow... > I noticed this when I was running a job on an amazon emr cluster with > Spark 2.4.4, and it got done slower than when I had ran it locally (on > Spark 2.4.1). I checked out the event logs, and the one from the newer > version had more stages. > Then I decided to do a comparison in the same environment so I created the > two versions of the same cluster with the only difference being the emr > release, and hence the spark version(?) - first one was emr-5.24.1 with > Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, > the same thing happened with the newer version having more stages and > taking almost twice as long to finish. > So I am pretty much at a loss here - could it be that it is not because of > spark itself, but because of some difference introduced in the emr > releases? At the moment I can't think of any other alternative besides it > being a bug... > > Here are the two event logs: > > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > and my code is here: > https://github.com/kgskgs/stars-spark3d > > I ran it like so on the clusters (after putting it on s3): > spark-submit --deploy-mode cluster --py-files > s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py > --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 > --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/ > > So yeah I was considering submitting a bug report, but in the guide it > said it's better to ask here first, so any ideas on what's going on? Maybe > I am missing something? > > Regards, > Kalin >