[ https://issues.apache.org/jira/browse/SPARK-17223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
K updated SPARK-17223: ---------------------- Description: Hi everyone, We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here: https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each) Code is here (JIRA corrupted it so moved to google doc) https://docs.google.com/document/d/19unfhSMMCjoXqhmFOA1omm4V2wHaraY0RxZesbQluZU/edit?usp=sharing ERROR: Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB was: Hi everyone, We have a dataset with ~500 column. If I called a LabelIndexer on it and tried to print out the first line, it fails with "grows beyond 64KB" error below. My original dataset had >20K rows, I stripped to 100 rows, but didn't help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random Forest into Pipeline but we are not having much luck here :( We tried with 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the data file here: https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each) Code: k_temp7 = load_csv_file('spark_bug.csv') from pyspark.ml import Pipeline from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler from pyspark.ml.classification import RandomForestClassifier # # Fit on whole dataset to include all labels in index. labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel") \ .setHandleInvalid("skip") \ .fit(k_temp7) weights = [0.70, 0.15, 0.15] seed = 42 df_train, df_validation, df_test = k_temp7.randomSplit(weights, seed) #feature_assembler = VectorAssembler(inputCols=["SomeUnknownEmptyCategory"], \ # outputCol="train_features") # # Train a RandomForest model. rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="train_features", \ predictionCol="prediction", numTrees=10) pipeline = Pipeline(stages=[labelIndexer]) #, feature_assembler, rf]) model = pipeline.fit(df_train) # # Measure performance of the model on validation dataset model_output = model.transform(df_train) #this fails print model_output.first() ERROR: Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Code of method "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB > "grows beyond 64 KB" with data frame with many columns > ------------------------------------------------------ > > Key: SPARK-17223 > URL: https://issues.apache.org/jira/browse/SPARK-17223 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 2.0.0, 2.1.0 > Reporter: K > > Hi everyone, > We have a dataset with ~500 column. If I called a LabelIndexer on it and > tried to print out the first line, it fails with "grows beyond 64KB" error > below. My original dataset had >20K rows, I stripped to 100 rows, but didn't > help. Eventually, we want to feed LabelIndexer, VectorAssembler and Random > Forest into Pipeline but we are not having much luck here :( We tried with > 2.0.0, and 2.1.0(snapshot as of 8/23). The problem is reproducible with the > data file here: > https://drive.google.com/file/d/0B2zl8xCBUVh6TFZDd3ZSUTNsam8/view?usp=sharing > Environment: Cluster with 2 nodes (CentOS, 64GB RAM and 8 cores each) > Code is here (JIRA corrupted it so moved to google doc) > https://docs.google.com/document/d/19unfhSMMCjoXqhmFOA1omm4V2wHaraY0RxZesbQluZU/edit?usp=sharing > ERROR: > Py4JJavaError: An error occurred while calling > z:org.apache.spark.sql.execution.python.EvaluatePython.takeAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 250.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 250.0 (TID 4666, ip): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "compare(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I" > of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" > grows beyond 64 KB -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org