[ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174727#comment-16174727
 ] 

Marco Veluscek commented on SPARK-16845:
----------------------------------------

Hello, 
I have just encountered a similar issue when doing _except_ on two large 
dataframes.
My code executed with Spark 2.1.0 fails with an exception. The same code with 
Spark 2.2.0 works, but logs several exceptions. 
Since, I have to work with 2.1.0 because of company policies, I would like to 
know whether there is a way to fix or to work around this issue in 2.1.0?

Here are more details about the problem.
On my company cluster, I am working with Spark version 2.1.0.cloudera1 using 
Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112).

The two dataframes have about 1 million rows and 467 columns.
When I do the _except_ {{dataframe1.except(dataframe2)}} I get the following 
exception:
{code:title=Exception_with_2.1.0}
scheduler.TaskSetManager: Lost task 10.0 in stage 80.0 (TID 4146, 
cdhworker05.itec.lab, executor 4): java.util.concurrent.ExecutionException: 
java.lang.Exception: failed to compile: org.co
dehaus.janino.JaninoRuntimeException: Code of method 
"eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
grows beyond 64 KB
{code}

Then the logs show the generated code for the class {{SpecificPredicate}} which 
has more than 5000 rows.

I wrote a small script to reproduce the error:
{code:title=testExcept.scala}
import org.apache.spark.sql.functions._

import spark.implicits._

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, StructField, StructType, 
IntegerType}

import scala.util.Random

def start(rows: Int, cols: Int, col: String, spark: SparkSession) = {

     val data = (1 to rows).map(_ => Seq.fill(cols)(1))

     val colNames = (1 to cols).mkString(",")
     val sch = StructType(colNames.split(",").map(fieldName => 
StructField(fieldName, IntegerType, true)))

     val rdd = spark.sparkContext.parallelize(data.map(x => Row(x:_*)))
     spark.sqlContext.createDataFrame(rdd, sch)
}

val dataframe1 = start(1000, 500, "column", spark)
val dataframe2 = start(1000, 500, "column", spark)

val res = dataframe1.except(dataframe2)

res.count()
{code}

I have also tried with a local Spark installation, version 2.2.0 using Scala 
version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131).
With this Spark version, the code does not fail but it logs several exceptions 
all saying the below:
{code:title=Exception_with_2.2.0}
17/09/21 12:42:26 ERROR CodeGenerator: failed to compile: 
org.codehaus.janino.JaninoRuntimeException: Code of method 
"eval(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
grows beyond 64 KB
{code}
Then the same generated code is logged.

In addition, this line is also logged several times:
{code}
17/09/21 12:46:20 WARN SortMergeJoinExec: Codegen disabled for this expression: 
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((...
{code}

Since I have to work with Spark 2.1.0, is there a way to work around this 
problem? Maybe disabling the code gen?

Thank you for your help.


> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16845
>                 URL: https://issues.apache.org/jira/browse/SPARK-16845
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: hejie
>            Assignee: Liwei Lin
>             Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>         Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>       ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>       at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>       at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to