Ala Luszczak created SPARK-40059: ------------------------------------ Summary: Row indexes can overshadow user-created data Key: SPARK-40059 URL: https://issues.apache.org/jira/browse/SPARK-40059 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Ala Luszczak
https://github.com/apache/spark/pull/37228 introduces ability to compute row indexes, which users can access through `_metadata.row_index` column. Internally this is achieved with the help of an extra column `_tmp_metadata_row_index`. When present in the schema sent to parquet reader, the reader populates it with row indexes, and the values are later placed in the `_metadata` struct. While relatively unlikely, it's still possible, that a user might want to include column `_tmp_metadata_row_index` in their data. In such scenario, the column will be populated with row indexes, rather than data read from the file. For repro, search `FileMetadataStructRowIndexSuite.scala` for this Jira ticket number. We could introduce some kind of countermeasure to handle this scenario. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org