Adrian Pusty created SPARK-54037:
------------------------------------

             Summary: Throughput deteriorated after migration from spark 3.5.5 
to spark 4.0.0w
                 Key: SPARK-54037
                 URL: https://issues.apache.org/jira/browse/SPARK-54037
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 4.0.0
            Reporter: Adrian Pusty


My team recently updated spark dependency version from 3.5.5 to 4.0.0
This included use of spark-4.0.0-bin-hadoop3.tgz, update in pom.xml files and 
change of import statements (org.apache.spark.sql -> 
org.apache.spark.sql.classic).

After this change our throughput (calculated as rows transferred per second) 
has significantly dropped for our both scenarios: 1. read from file, write to 
database and 2. read from database, write to database.

I have performed comparison between application versions with spark 3.5.5 and 
4.0.0 in cluster mode, local mode and one comparison (with use of synthetic 
file) using spark-shell only.
In case of spark-shell I had more or less the same throughput for 3.5.5 and 
4.0.0 but in case of our app used in cluster / local mode - both of these 
scenarios had better throughput with 3.5.5.

I have observed that with 4.0.0 there are longer delays (when compared with 
3.5.5) between log lines
"Running task x in stage y"
and
"Finished task x in stage y".

Is this throughput degradation a known issue? Could it be related to this task 
- [SPARK-48456] [M1] Performance benchmark - ASF JIRA ?

(I'll also mention that we are using checkpointing (in case it might be 
important here))



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to