[ https://issues.apache.org/jira/browse/SPARK-45745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce Robbins resolved SPARK-45745. ----------------------------------- Resolution: Duplicate > Extremely slow execution of sum of columns in Spark 3.4.1 > --------------------------------------------------------- > > Key: SPARK-45745 > URL: https://issues.apache.org/jira/browse/SPARK-45745 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.4.1 > Reporter: Javier > Priority: Major > > We are in the process of upgrading some pySpark jobs from Spark 3.1.2 to > Spark 3.4.1 and some code that was running fine is now basically never ending > even for small dataframes. > We have simplified the problematic piece of code and the minimum pySpark > example below shows the issue: > {code:java} > n_cols = 50 > data = [{f"col{i}": i for i in range(n_cols)} for _ in range(5)] > df_data = sql_context.createDataFrame(data) > df_data = df_data.withColumn( > "col_sum", sum([F.col(f"col{i}") for i in range(n_cols)]) > ) > df_data.show(10, False) {code} > Basically, this code with Spark 3.1.2 runs fine but with 3.4.1 the > computation time seems to explode when the value of `n_cols` is bigger than > about 25 columns. A colleague suggested that it could be related to the limit > of 22 elements in a tuple in Scala 2.13 > (https://www.scala-lang.org/api/current/scala/Tuple22.html), since the 25 > columns are suspiciously close to this. Is there any known defect in the > logical plan optimization in 3.4.1? Or is this kind of operations (sum of > multiple columns) supposed to be implemented differently? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org