[ https://issues.apache.org/jira/browse/SPARK-35461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348678#comment-17348678 ]
Dongjoon Hyun commented on SPARK-35461: --------------------------------------- For a record, Apache Spark file-based data sources have different capabilities like we don't expect much capability at TEXT data sources. Parquet data source has been having this limitation for a long time. > Error when reading dictionary-encoded Parquet int column when read schema is > bigint > ----------------------------------------------------------------------------------- > > Key: SPARK-35461 > URL: https://issues.apache.org/jira/browse/SPARK-35461 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.2, 3.1.1 > Reporter: Chao Sun > Priority: Major > > When reading a dictionary-encoded integer column from a Parquet file, and > users specify read schema to be bigint, Spark currently will fail with the > following exception: > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary > at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344) > {code} > To reproduce: > {code} > val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, > i.toString)) > withParquetFile(data) { path => > val readSchema = StructType(Seq(StructField("_1", LongType))) > spark.read.schema(readSchema).parquet(path).first() > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org