[jira] [Updated] (SPARK-47731) Fix the 2b+ rows in a single rowgroup for row_index in Parquet reader

Thang Long Vu (Jira) Thu, 04 Apr 2024 08:30:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thang Long Vu updated SPARK-47731:
----------------------------------
    Description: 
Parquet reader in Spark has a bug where a file containing 2b+ rows in a single 
rowgroup causes it to run out of the `Integer` range. This prevents Delta 
Parquet readers from exposing the row_index field as a metadata field.

It would be great to have this fix so that we can use 2b+ rows in a single 
rowgroup and also to safely allow row_index field to be used in the Delta 
Parquet readers for any functionalities that might depend on it.

Link to the comment in the code: 
https://github.com/delta-io/delta/blob/e3a481bd6c42a4f91686377d78ec9d9c934e27ee/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L200

  was:
Parquet reader in Spark has a bug where a file containing 2b+ rows in a single 
rowgroup causes it to run out of the `Integer` range. This prevents Delta 
Parquet readers from exposing the row_index field as a metadata field.

 

It would be great to have this fix so that we can use 2b+ rows in a single 
rowgroup and also to safely allow row_index field to be used in the Delta 
Parquet readers for any functionalities that might depend on it.


> Fix the 2b+ rows in a single rowgroup for row_index in Parquet reader
> ---------------------------------------------------------------------
>
>                 Key: SPARK-47731
>                 URL: https://issues.apache.org/jira/browse/SPARK-47731
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.0, 4.0.0
>            Reporter: Thang Long Vu
>            Priority: Major
>
> Parquet reader in Spark has a bug where a file containing 2b+ rows in a 
> single rowgroup causes it to run out of the `Integer` range. This prevents 
> Delta Parquet readers from exposing the row_index field as a metadata field.
> It would be great to have this fix so that we can use 2b+ rows in a single 
> rowgroup and also to safely allow row_index field to be used in the Delta 
> Parquet readers for any functionalities that might depend on it.
> Link to the comment in the code: 
> https://github.com/delta-io/delta/blob/e3a481bd6c42a4f91686377d78ec9d9c934e27ee/spark/src/main/scala/org/apache/spark/sql/delta/DeltaParquetFileFormat.scala#L200



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47731) Fix the 2b+ rows in a single rowgroup for row_index in Parquet reader

Reply via email to