Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19480#discussion_r144688322 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -2103,4 +2103,35 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)), Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2))) } + + test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") { + val colNumber = 10000 + val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*)) + val df = sqlContext.createDataFrame(input, StructType( + (1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false)))) + val newCols = (1 to colNumber).flatMap { colIndex => + Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"), + expr(s"sqrt(_$colIndex)")) + } + df.select(newCols: _*).collect() + } + + test("SPARK-22226: too many splitted expressions should not exceed constant pool limit") { --- End diff -- You are right @viirya. Sorry, I didn't notice. Yes the problem is that most of the times we have both these issues at the moment, thus solving one is not enough. It turns out that there are some corner cases in which this fix is enough, like the real case I am working on. But it is not easy to reproduce them in a simple way. In this use case there are a lot of complex projections a `dropDuplicate` and some joins after that. But there are query made of thousands of lines of SQL code. The only way I have been able to reproduce it easily is in this test case: https://github.com/apache/spark/pull/19480/files#r144302922.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org