[ 
https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins resolved SPARK-45745.
-----------------------------------
    Resolution: Duplicate

> Extremely slow execution of sum of columns in Spark 3.4.1
> ---------------------------------------------------------
>
>                 Key: SPARK-45745
>                 URL: https://issues.apache.org/jira/browse/SPARK-45745
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.4.1
>            Reporter: Javier
>            Priority: Major
>
> We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to 
> Spark 3.4.1 and some code that was running fine is now basically never ending 
> even for small dataframes.
> We have simplified the problematic piece of code and the minimum pySpark 
> example below shows the issue:
> {code:java}
> n_cols = 50
> data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)]
> df_data = sql_context.createDataFrame(data)
> df_data = df_data.withColumn(
>     "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)])
> )
> df_data.show(10, False) {code}
> Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the 
> computation time seems to explode when the value of `n_cols` is bigger than 
> about 25 columns. A colleague suggested that it could be related to the limit 
> of 22 elements in a tuple in Scala 2.13 
> (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 
> columns are suspiciously close to this. Is there any known defect in the 
> logical plan optimization in 3.4.1? Or is this kind of operations (sum of 
> multiple columns) supposed to be implemented differently?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to