[ https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174727#comment-16174727 ]
Marco Veluscek commented on SPARK-16845: ---------------------------------------- Hello, I have just encountered a similar issue when doing _except_ on two large dataframes. My code executed with Spark 2.1.0 fails with an exception. The same code with Spark 2.2.0 works, but logs several exceptions. Since, I have to work with 2.1.0 because of company policies, I would like to know whether there is a way to fix or to work around this issue in 2.1.0? Here are more details about the problem. On my company cluster, I am working with Spark version 2.1.0.cloudera1 using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112). The two dataframes have about 1 million rows and 467 columns. When I do the _except_ {{dataframe1.except(dataframe2)}} I get the following exception: {code:title=Exception_with_2.1.0} scheduler.TaskSetManager: Lost task 10.0 in stage 80.0 (TID 4146, cdhworker05.itec.lab, executor 4): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.co dehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB {code} Then the logs show the generated code for the class {{SpecificPredicate}} which has more than 5000 rows. I wrote a small script to reproduce the error: {code:title=testExcept.scala} import org.apache.spark.sql.functions._ import spark.implicits._ import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, StructField, StructType, IntegerType} import scala.util.Random def start(rows: Int, cols: Int, col: String, spark: SparkSession) = { val data = (1 to rows).map(_ => Seq.fill(cols)(1)) val colNames = (1 to cols).mkString(",") val sch = StructType(colNames.split(",").map(fieldName => StructField(fieldName, IntegerType, true))) val rdd = spark.sparkContext.parallelize(data.map(x => Row(x:_*))) spark.sqlContext.createDataFrame(rdd, sch) } val dataframe1 = start(1000, 500, "column", spark) val dataframe2 = start(1000, 500, "column", spark) val res = dataframe1.except(dataframe2) res.count() {code} I have also tried with a local Spark installation, version 2.2.0 using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131). With this Spark version, the code does not fail but it logs several exceptions all saying the below: {code:title=Exception_with_2.2.0} 17/09/21 12:42:26 ERROR CodeGenerator: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" grows beyond 64 KB {code} Then the same generated code is logged. In addition, this line is also logged several times: {code} 17/09/21 12:46:20 WARN SortMergeJoinExec: Codegen disabled for this expression: (((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((... {code} Since I have to work with Spark 2.1.0, is there a way to work around this problem? Maybe disabling the code gen? Thank you for your help. > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > --------------------------------------------------------------------------------------------- > > Key: SPARK-16845 > URL: https://issues.apache.org/jira/browse/SPARK-16845 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: hejie > Assignee: Liwei Lin > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > Attachments: error.txt.zip > > > I have a wide table(400 columns), when I try fitting the traindata on all > columns, the fatal error occurs. > ... 46 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org