Adrian Pusty created SPARK-54037:
------------------------------------
Summary: Throughput deteriorated after migration from spark 3.5.5
to spark 4.0.0w
Key: SPARK-54037
URL: https://issues.apache.org/jira/browse/SPARK-54037
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.0.0
Reporter: Adrian Pusty
My team recently updated spark dependency version from 3.5.5 to 4.0.0
This included use of spark-4.0.0-bin-hadoop3.tgz, update in pom.xml files and
change of import statements (org.apache.spark.sql ->
org.apache.spark.sql.classic).
After this change our throughput (calculated as rows transferred per second)
has significantly dropped for our both scenarios: 1. read from file, write to
database and 2. read from database, write to database.
I have performed comparison between application versions with spark 3.5.5 and
4.0.0 in cluster mode, local mode and one comparison (with use of synthetic
file) using spark-shell only.
In case of spark-shell I had more or less the same throughput for 3.5.5 and
4.0.0 but in case of our app used in cluster / local mode - both of these
scenarios had better throughput with 3.5.5.
I have observed that with 4.0.0 there are longer delays (when compared with
3.5.5) between log lines
"Running task x in stage y"
and
"Finished task x in stage y".
Is this throughput degradation a known issue? Could it be related to this task
- [SPARK-48456] [M1] Performance benchmark - ASF JIRA ?
(I'll also mention that we are using checkpointing (in case it might be
important here))
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]