[ https://issues.apache.org/jira/browse/SPARK-24828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yuming Wang updated SPARK-24828: -------------------------------- Attachment: image-2018-07-18-13-57-21-148.png > Incompatible parquet formats - java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-24828 > URL: https://issues.apache.org/jira/browse/SPARK-24828 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Environment: Environment for creating the parquet file: > IBM Watson Studio Apache Spark Service, V2.1.2 > Environment for reading the parquet file: > java version "1.8.0_144" > Java(TM) SE Runtime Environment (build 1.8.0_144-b01) > Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode) > MacOSX 10.13.3 (17D47) > Spark spark-2.1.2-bin-hadoop2.7 directly obtained from > http://spark.apache.org/downloads.html > Reporter: Romeo Kienzer > Priority: Minor > Attachments: a2_m2.parquet.zip, image-2018-07-18-13-57-21-148.png > > > As requested by [~hyukjin.kwon] here a new issue - related issue can be found > here > > Using the attached parquet file from one Spark installation, reading it using > an installation directly obtained from > [http://spark.apache.org/downloads.html] yields to the following exception: > > 18/07/17 07:40:38 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 4) > scala.MatchError: [1.0,null] (of class > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator$$anonfun$1.apply(MulticlassClassificationEvaluator.scala:79) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/07/17 07:40:38 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > > The file is attached [^a2_m2.parquet.zip] > > The following code reproduces the error: > df = spark.read.parquet('a2_m2.parquet') > from pyspark.ml.evaluation import MulticlassClassificationEvaluator > binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") > .setPredictionCol("prediction").setLabelCol("label") > accuracy = binEval.evaluate(df) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org