[ https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gengliang Wang reassigned SPARK-39926: -------------------------------------- Assignee: Daniel > Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans > --------------------------------------------------------------------------- > > Key: SPARK-39926 > URL: https://issues.apache.org/jira/browse/SPARK-39926 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 3.4.0 > Reporter: Daniel > Assignee: Daniel > Priority: Major > > How to reproduce: > {code:sql} > set spark.sql.parquet.enableVectorizedReader=false; > create table t(a int) using parquet; > insert into t values (42); > alter table t add column b int default 42; > insert into t values (43, null); > select * from t; > {code} > This should return two rows: > (42, 42) and (43, NULL) > But instead the scan misses the inserted NULL value, and returns the > existence DEFAULT value of "42" instead: > (42, 42) and (43, 42). > > This bug happens because the Parquet API calls one of these set* methods in > ParquetRowConverter.scala whenever it finds a non-NULL value: > {code:scala} > private class RowUpdater(row: InternalRow, ordinal: Int) > extends ParentContainerUpdater { > override def set(value: Any): Unit = row(ordinal) = value > override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, > value) > override def setByte(value: Byte): Unit = row.setByte(ordinal, value) > override def setShort(value: Short): Unit = row.setShort(ordinal, value) > override def setInt(value: Int): Unit = row.setInt(ordinal, value) > override def setLong(value: Long): Unit = row.setLong(ordinal, value) > override def setDouble(value: Double): Unit = row.setDouble(ordinal, value) > override def setFloat(value: Float): Unit = row.setFloat(ordinal, value) > } > {code} > > But it never calls anything like "setNull()" when encountering a NULL value. > To fix the bug, we need to know how many columns of data were present in each > row of the Parquet data, so we can differentiate between a NULL value and a > missing column. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org