[ https://issues.apache.org/jira/browse/SPARK-47731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thang Long Vu updated SPARK-47731: ---------------------------------- Description: Parquet reader in Spark has a bug where a file containing 2b+ rows in a single rowgroup causes it to run out of the `Integer` range. This prevents Delta Parquet readers from exposing the row_index field as a metadata field. It would be great to have this fix so that we can use 2b+ rows in a single rowgroup and also to safely allow row_index field to be used in the Delta Parquet readers for any functionalities that might depend on it. Link to the comment in the code: https://github.com/delta-io/delta/blob/e3a481bd6c42a4f91686377d78ec9d9c934e27ee/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L200 was: Parquet reader in Spark has a bug where a file containing 2b+ rows in a single rowgroup causes it to run out of the `Integer` range. This prevents Delta Parquet readers from exposing the row_index field as a metadata field. It would be great to have this fix so that we can use 2b+ rows in a single rowgroup and also to safely allow row_index field to be used in the Delta Parquet readers for any functionalities that might depend on it. > Fix the 2b+ rows in a single rowgroup for row_index in Parquet reader > --------------------------------------------------------------------- > > Key: SPARK-47731 > URL: https://issues.apache.org/jira/browse/SPARK-47731 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.0, 4.0.0 > Reporter: Thang Long Vu > Priority: Major > > Parquet reader in Spark has a bug where a file containing 2b+ rows in a > single rowgroup causes it to run out of the `Integer` range. This prevents > Delta Parquet readers from exposing the row_index field as a metadata field. > It would be great to have this fix so that we can use 2b+ rows in a single > rowgroup and also to safely allow row_index field to be used in the Delta > Parquet readers for any functionalities that might depend on it. > Link to the comment in the code: > https://github.com/delta-io/delta/blob/e3a481bd6c42a4f91686377d78ec9d9c934e27ee/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L200 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org