[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103853#comment-16103853 ]
James Conner commented on SPARK-18016: -------------------------------------- The issue does not appear to be completely fixed. I downloaded the master today (last commit 2ff35a057efd36bd5c8a545a1ec3bc341432a904, Spark 2.3.0-SNAPSHOT), and attempted to perform a Gradient Boosted Tree (GBT) Regression on a dataframe which contains 2658 columns. I'm still getting the same constant pool exceeding JVM 0xFFFF error. The steps that I'm using to generate the error are: {code:scala} // Imports import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor} import org.apache.spark.ml.feature.{VectorAssembler, Imputer, ImputerModel} import org.apache.spark.ml.linalg.Vectors import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.sql.{SQLContext, Row, DataFrame, Column} // Load data val mainDF = spark.read.parquet("/path/to/data/input/main_pqt").repartition(1024) // Impute data val inColsMain = mainDF.columns.filter(_ != "ID").filter(_ != "SCORE") val outColsMain = inColsMain.map(i=>(i+"_imputed")) val mainImputer = new Imputer().setInputCols(inColsMain).setOutputCols(outColsMain).setStrategy("mean") val mainImputerBuild = mainImputer.fit(mainDF) val imputedMainDF = mainImputerBuild.transform(mainDF) // Drop original feature columns, retain imputed val fullData = imputedMainDF.select(imputedMainDF.columns.filter(colName => !inColsMain.contains(colName)).map(colName => new Column(colName)): _*) // Split data for testing & training val Array(trainDF, testDF) = fullData.randomSplit(Array(0.80, 0.20),seed = 12345) // Vector Assembler val arrayName = fullData.columns.filter(_ != "ID").filter(_ != "SCORE") val assembler = new VectorAssembler().setInputCols(arrayName).setOutputCol("features") // GBT Object val gbt = new GBTRegressor().setLabelCol("SCORE").setFeaturesCol("features").setMaxIter(5).setSeed(1993).setLossType("squared").setSubsamplingRate(1) // Pipeline Object val pipeline = new Pipeline().setStages(Array(assembler, gbt)) // Hyper Parameter Grid Object val paramGrid = new ParamGridBuilder().addGrid(gbt.maxBins, Array(2, 4, 8)).addGrid(gbt.maxDepth, Array(1, 2, 4)).addGrid(gbt.stepSize, Array(0.1, 0.2)).build() // Evaluator Object val evaluator = new RegressionEvaluator().setLabelCol("SCORE").setPredictionCol("prediction").setMetricName("rmse") // Cross Validation Object val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(5) // Build the model val gbtModel = cv.fit(trainDF) {code} Upon building the model, it errors out with the following cause: {code:java} java.lang.RuntimeException: Error while encoding: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection$NestedClass2 has grown past JVM limit of 0xFFFF {code} > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > ----------------------------------------------------------------- > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Aleksander Eskilson > Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0xFFFF > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:11114) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:905) > ... 35 more > {code} > During generation of the code for SpecificUnsafeProjection, all the mutable > variables are declared up front. If there are too many, it seems it perhaps > exceeds some type of resource limit. > This issue seems related to (but is not fixed by) SPARK-17702, which itself > was about the size of individual methods growing beyond the 64 KB limit. > SPARK-17702 was resolved by breaking extractions into smaller methods, but > does not seem to have resolved this issue. > I've created a small project [1] where I declare a list of "wide" and > "nested" Bean objects that I attempt to encode to a Dataset. This code can > trigger the failure for Spark 2.1.0-SNAPSHOT. > [1] - https://github.com/bdrillard/spark-codegen-error -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org