[ 
https://issues.apache.org/jira/browse/SPARK-39926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-39926:
--------------------------------------

    Assignee: Daniel

> Fix bug in existence DEFAULT value lookups for non-vectorized Parquet scans
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-39926
>                 URL: https://issues.apache.org/jira/browse/SPARK-39926
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Daniel
>            Assignee: Daniel
>            Priority: Major
>
> How to reproduce:
> {code:sql}
> set spark.sql.parquet.enableVectorizedReader=false;
> create table t(a int) using parquet;
> insert into t values (42);
> alter table t add column b int default 42;
> insert into t values (43, null);
> select * from t;
> {code}
> This should return two rows:
> (42, 42) and (43, NULL)
> But instead the scan misses the inserted NULL value, and returns the 
> existence DEFAULT value of "42" instead:
> (42, 42) and (43, 42).
>  
> This bug happens because the Parquet API calls one of these set* methods in 
> ParquetRowConverter.scala whenever it finds a non-NULL value:
> {code:scala}
> private class RowUpdater(row: InternalRow, ordinal: Int)
> extends ParentContainerUpdater {
>   override def set(value: Any): Unit = row(ordinal) = value
>   override def setBoolean(value: Boolean): Unit = row.setBoolean(ordinal, 
> value)
>   override def setByte(value: Byte): Unit = row.setByte(ordinal, value)
>   override def setShort(value: Short): Unit = row.setShort(ordinal, value)
>   override def setInt(value: Int): Unit = row.setInt(ordinal, value)
>   override def setLong(value: Long): Unit = row.setLong(ordinal, value)
>   override def setDouble(value: Double): Unit = row.setDouble(ordinal, value)
>   override def setFloat(value: Float): Unit = row.setFloat(ordinal, value)
> }
>  {code}
>  
> But it never calls anything like "setNull()" when encountering a NULL value.
> To fix the bug, we need to know how many columns of data were present in each 
> row of the Parquet data, so we can differentiate between a NULL value and a 
> missing column.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to