[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057713#comment-17057713 ]
Jungtaek Lim commented on SPARK-25987: -------------------------------------- The root cause is the way of how "flowAnalysis" in Janino works. [https://github.com/janino-compiler/janino/blob/ccb4931fd605ed8081839f962b57ac3734db78ee/janino/src/main/java/org/codehaus/janino/CodeContext.java] (The commit is v3.0.15 - flowAnalysis has been replaced with stack map in 3.1.x, but it seems to bring another bugs.) Janino calls flowAnalysis to analyze and verify stack size for offsets; basically it analyzes entire offsets sequentially via loop, but when it encounters "branch", it calls "flowAnalysis" recursively with target jump offset. Assuming that the code is constructed like "if () \{ ... } if () \{ ... } if () \{ ... } ...", call stack will be filled with "flowAnalysis", even they're not nested. It stores analyzed stack size for each offset to avoid processing from same offset multiple time, but it won't help this case. So upgrading Janino to 3.0.x won't help this issue. In the meanwhile, adding "-Xss2m" makes the code pass in branch-2.4 without any new patch (even no Janino upgrade). The generated code between branch-2.4 and master branch are a bit slightly different, especially in "processNext()", which seems to help avoiding the issue but not sure why. > StackOverflowError when executing many operations on a table with many columns > ------------------------------------------------------------------------------ > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" > Reporter: Ivan Tsukanov > Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org