[ 
https://issues.apache.org/jira/browse/SPARK-35461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17348667#comment-17348667
 ] 

Chao Sun commented on SPARK-35461:
----------------------------------

Actually this also fails when turning off the vectorized reader:
{code}
Caused by: java.lang.ClassCastException: class 
org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to class 
org.apache.spark.sql.catalyst.expressions.MutableInt 
(org.apache.spark.sql.catalyst.expressions.MutableLong and 
org.apache.spark.sql.catalyst.expressions.MutableInt are in unnamed module of 
loader 'app')
        at 
org.apache.spark.sql.catalyst.expressions.SpecificInternalRow.setInt(SpecificInternalRow.scala:253)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:178)
        at 
org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:88)
        at 
org.apache.parquet.column.impl.ColumnReaderBase$2$3.writeValue(ColumnReaderBase.java:297)
        at 
org.apache.parquet.column.impl.ColumnReaderBase.writeCurrentValueToConverter(ColumnReaderBase.java:440)
        at 
org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:30)
        at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
        at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:229)
{code}
In this case parquet-mr is able to return the value but Spark won't be able to 
handle it.

> Error when reading dictionary-encoded Parquet int column when read schema is 
> bigint
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-35461
>                 URL: https://issues.apache.org/jira/browse/SPARK-35461
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.2, 3.1.1
>            Reporter: Chao Sun
>            Priority: Major
>
> When reading a dictionary-encoded integer column from a Parquet file, and 
> users specify read schema to be bigint, Spark currently will fail with the 
> following exception:
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
>       at org.apache.parquet.column.Dictionary.decodeToLong(Dictionary.java:49)
>       at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:50)
>       at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
>       at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:344)
> {code}
> To reproduce:
> {code}
>     val data = (0 to 10).flatMap(n => Seq.fill(10)(n)).map(i => (i, 
> i.toString))
>     withParquetFile(data) { path =>
>       val readSchema = StructType(Seq(StructField("_1", LongType)))
>       spark.read.schema(readSchema).parquet(path).first()
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to