[ https://issues.apache.org/jira/browse/SPARK-35461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348667#comment-17348667 ]
Chao Sun commented on SPARK-35461: ---------------------------------- Actually this also fails when turning off the vectorized reader: {code} Caused by: java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to class org.apache.spark.sql.catalyst.expressions.MutableInt (org.apache.spark.sql.catalyst.expressions.MutableLong and org.apache.spark.sql.catalyst.expressions.MutableInt are in unnamed module of loader 'app') at org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setInt(SpecificInternalRow.scala:253) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:178) at org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:88) at org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:297) at org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440) at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30) at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229) {code} In this case parquet-mr is able to return the value but Spark won't be able to handle it. > Error when reading dictionary-encoded Parquet int column when read schema is > bigint > ----------------------------------------------------------------------------------- > > Key: SPARK-35461 > URL: https://issues.apache.org/jira/browse/SPARK-35461 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2, 3.1.1 > Reporter: Chao Sun > Priority: Major > > When reading a dictionary-encoded integer column from a Parquet file, and > users specify read schema to be bigint, Spark currently will fail with the > following exception: > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344) > {code} > To reproduce: > {code} > val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, > i.toString)) > withParquetFile(data) { path => > val readSchema = StructType(Seq(StructField("_1", LongType))) > spark.read.schema(readSchema).parquet(path).first() > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org