Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Xiao Li Wed, 15 Jan 2020 09:59:23 -0800

EMR is having their own fork of Spark, called EMR runtime. They are not
Apache Spark. You might need to talk with them instead of posting questions
in the Apache Spark community.


Cheers,

Xiao

Kalin Stoyanov <kgs.v...@gmail.com> 于2020年1月15日周三 上午9:53写道：

> Hi all,
>
> First of all let me say that I am pretty new to Spark so this could be
> entirely my fault somehow...
> I noticed this when I was running a job on an amazon emr cluster with
> Spark 2.4.4, and it got done slower than when I had ran it locally (on
> Spark 2.4.1). I checked out the event logs, and the one from the newer
> version had more stages.
> Then I decided to do a comparison in the same environment so I created the
> two versions of the same cluster with the only difference being the emr
> release, and hence the spark version(?) - first one was emr-5.24.1 with
> Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough,
> the same thing happened with the newer version having more stages and
> taking almost twice as long to finish.
> So I am pretty much at a loss here - could it be that it is not because of
> spark itself, but because of some difference introduced in the emr
> releases? At the moment I can't think of any other alternative besides it
> being a bug...
>
> Here are the two event logs:
>
> https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
> and my code is here:
> https://github.com/kgskgs/stars-spark3d
>
> I ran it like so on the clusters (after putting it on s3):
> spark-submit --deploy-mode cluster --py-files
> s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py
> --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100
> --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/
>
> So yeah I was considering submitting a bug report, but in the guide it
> said it's better to ask here first, so any ideas on what's going on? Maybe
> I am missing something?
>
> Regards,
> Kalin
>

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Reply via email to