[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553425#comment-17553425 ] Gidon Gershinsky commented on PARQUET-2117: --- [~sha...@uber.com] Could you add [~prakharjain09] to the Parquet contributors. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.12.3 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551740#comment-17551740 ] Prakhar Jain commented on PARQUET-2117: --- Resolving this issue as the PR is merged. [~gershinsky] Could you reassign the Jira to me? > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.12.3 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527900#comment-17527900 ] ASF GitHub Bot commented on PARQUET-2117: - ggershinsky commented on PR #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1109371055 @prakharjain09 hopefully, we'll resolve the remaining issues at the community sync tomorrow, and start working on a cut. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527760#comment-17527760 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on PR #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1109035222 @ggershinsky Is there any tentative date / rough estimate for when are we planning to do RC cut for the next release? > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513174#comment-17513174 ] ASF GitHub Bot commented on PARQUET-2117: - ggershinsky commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1080253613 @prakharjain09 the upcoming parquet release will include the current master (plus a couple of WIP PRs, once they are merged), so this patch will be covered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513123#comment-17513123 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1080148066 @shangxinli @ggershinsky Thanks a lot for reviewing this change. This will unblock SPARK-37980 if this is released as part of upcoming parquet release. Do we need to cherry-pick this to any release branch for the same? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509355#comment-17509355 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli merged pull request #945: URL: https://github.com/apache/parquet-mr/pull/945 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509176#comment-17509176 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1072953219 Thanks @ggershinsky for the review. I have addressed the comments and fixed the build issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509174#comment-17509174 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r830445563 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +275,46 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the row index of the current row. If no row has been processed or if the + * row index information is unavailable from the underlying @{@link PageReadStore}, returns -1. + */ + public long getCurrentRowIndex() { +if (current == 0L || rowIdxInFileItr == null) { + return -1; +} +return currentRowIdx; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowGroupRowIdxOffset = pages.getRowIndexOffset(); +currentRowIdx = -1; +if (rowGroupRowIdxOffset.isPresent()) { + final PrimitiveIterator.OfLong rowIdxInRowGroupItr; + if (pages.getRowIndexes().isPresent()) { +rowIdxInRowGroupItr = pages.getRowIndexes().get(); + } else { +rowIdxInRowGroupItr = LongStream.range(0, pages.getRowCount()).iterator(); + } + // Adjust the row group offset in the `rowIndexWithinRowGroupIterator` iterator. + this.rowIdxInFileItr = new PrimitiveIterator.OfLong() { +public long nextLong() { + return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.nextLong(); +} + +public boolean hasNext() { + return rowIdxInRowGroupItr.hasNext(); +} + +public Long next() { + return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.next(); +} + }; +} else { Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506940#comment-17506940 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1068061762 @prakharjain09 After you fix the CI failures, we can merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506762#comment-17506762 ] ASF GitHub Bot commented on PARQUET-2117: - ggershinsky commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1067671513 thanks for this change. The PR looks good to me now, I'll add my approval after it passes the CI tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505405#comment-17505405 ] ASF GitHub Bot commented on PARQUET-2117: - ggershinsky commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r825408797 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +275,46 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the row index of the current row. If no row has been processed or if the + * row index information is unavailable from the underlying @{@link PageReadStore}, returns -1. + */ + public long getCurrentRowIndex() { +if (current == 0L || rowIdxInFileItr == null) { + return -1; +} +return currentRowIdx; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowGroupRowIdxOffset = pages.getRowIndexOffset(); +currentRowIdx = -1; +if (rowGroupRowIdxOffset.isPresent()) { + final PrimitiveIterator.OfLong rowIdxInRowGroupItr; + if (pages.getRowIndexes().isPresent()) { +rowIdxInRowGroupItr = pages.getRowIndexes().get(); + } else { +rowIdxInRowGroupItr = LongStream.range(0, pages.getRowCount()).iterator(); + } + // Adjust the row group offset in the `rowIndexWithinRowGroupIterator` iterator. + this.rowIdxInFileItr = new PrimitiveIterator.OfLong() { +public long nextLong() { + return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.nextLong(); +} + +public boolean hasNext() { + return rowIdxInRowGroupItr.hasNext(); +} + +public Long next() { + return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.next(); +} + }; +} else { Review comment: nit: could you start the method with checking this condition (!rowGroupRowIdxOffset.isPresent()), and then return? Will look cleaner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502820#comment-17502820 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1061584498 @shangxinli Thanks for taking another look. I have addressed all comments other [than one](https://github.com/apache/parquet-mr/pull/945#discussion_r820928524). Please advice on the same. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502719#comment-17502719 ] ASF GitHub Bot commented on PARQUET-2117: - ggershinsky commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1061409993 hi guys, I'm OOO (vacation) this week. Can review it next week if helps, but feel free to go ahead without waiting for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502447#comment-17502447 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820930501 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java ## @@ -140,6 +140,16 @@ public T read() throws IOException { } } + /** + * Returns the row index of the last read row. If no row has been processed, returns -1. Review comment: fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502445#comment-17502445 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820928524 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReader.java ## @@ -46,10 +47,19 @@ private static final Path FILE_V1 = createTempFile(); private static final Path FILE_V2 = createTempFile(); - private static final List DATA = Collections.unmodifiableList(makeUsers(1)); + private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = createPathFromCP("/test-file-with-no-column-indexes-1.parquet"); Review comment: @shangxinli It looks like the [column-indexes](https://issues.apache.org/jira/browse/PARQUET-1201) are always written in the current version of parquet and are not configurable. We are already testing the new row index support with and without the column index filtering being triggered (as part of TestColumnIndexFiltering). Also the new row index feature doesn't rely on column indexes in any way. So we can skip the backward compatibility testing and remove this parquet file from resources. What do you think about this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502428#comment-17502428 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli edited a comment on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060907004 I just left some comments. Other than that, it looks good to me. Add @ggershinsky in case you have time to have a look. Beyond this PR, if the work you are doing in Iceberg/Spark can be done in Parquet, please consider adding them to Parquet-mr. With that, it can benefit all the applications that need parquet-mr. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502423#comment-17502423 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060907004 I just left some comments. Other than that, it looks good to me. Add @ggershinsky in case you have time to have a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502422#comment-17502422 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820904662 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReader.java ## @@ -46,10 +47,19 @@ private static final Path FILE_V1 = createTempFile(); private static final Path FILE_V2 = createTempFile(); - private static final List DATA = Collections.unmodifiableList(makeUsers(1)); + private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = createPathFromCP("/test-file-with-no-column-indexes-1.parquet"); Review comment: I am not sure if it is a good idea to check in a data file. Can you check if it is possible to stop generating offset index in the current version of Parquet? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502413#comment-17502413 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820893950 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java ## @@ -140,6 +140,16 @@ public T read() throws IOException { } } + /** + * Returns the row index of the last read row. If no row has been processed, returns -1. Review comment: Given this is a public method, we need to take care of the Java doc decorations. Please refer to other methods in this class and follow the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502411#comment-17502411 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820893950 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java ## @@ -140,6 +140,16 @@ public T read() throws IOException { } } + /** + * Returns the row index of the last read row. If no row has been processed, returns -1. Review comment: Given this is public method, we need to take care of the Java doc decrations -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502159#comment-17502159 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060365981 > Could you please look into the PR again? > Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner. @shangxinli Gentle reminder. Please take a look when you get chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499689#comment-17499689 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1055744409 @shangxinli Thanks a lot or the review. I have addressed the review comments. Could you please look into the PR again? Also could you share information about when are we planning to do code-freeze for next minor release? It will be great if we can release this change in next minor/patch release so that Apache Spark/other projects get to use this functionality sooner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499687#comment-17499687 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 edited a comment on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks a lot or the review. I have addressed most of the comments. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498261#comment-17498261 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815008882 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: There was no specific reason for choosing exception over -1. I have updated it to return -1 and also updated all the public method docs to reflect the same. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498260#comment-17498260 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815008136 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIdxInFileItr == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIdx; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowGroupRowIdxOffset = pages.getRowIndexOffset(); +currentRowIdx = -1L; +if (rowGroupRowIdxOffset.isPresent()) { + final PrimitiveIterator.OfLong rowIdxInRowGroupItr; + if (pages.getRowIndexes().isPresent()) { +rowIdxInRowGroupItr = pages.getRowIndexes().get(); + } else { +// If `pages.getRowIndexes()` is empty, this means column indexing has not triggered. Review comment: removed this code comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498259#comment-17498259 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815007880 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1476,6 +1509,13 @@ public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata) throws public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata, InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws IOException { +return fromParquetMetadata(parquetMetadata, fileDecryptor, encryptedFooter, generateRowGroupOffsets(parquetMetadata)); Review comment: Yes thats correct. Fixed this - now we are passing empty Map so that we don't populate incorrect rowIndexOffsets. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498252#comment-17498252 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814998446 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -227,6 +232,9 @@ public boolean nextKeyValue() throws IOException, InterruptedException { try { currentValue = recordReader.read(); + if (rowIdxInFileItr != null) { Review comment: done. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java ## @@ -33,6 +33,7 @@ private long totalByteSize; private String path; private int ordinal; + private long rowIndexOffset; Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498159#comment-17498159 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814825420 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: I understand why we are checking 'current == 0L'. I was asking why you choose throw exception other than returning an invalid value. This is a public method. We should have it documented ether way you choose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498157#comment-17498157 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814818756 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIdxInFileItr == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIdx; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowGroupRowIdxOffset = pages.getRowIndexOffset(); +currentRowIdx = -1L; +if (rowGroupRowIdxOffset.isPresent()) { + final PrimitiveIterator.OfLong rowIdxInRowGroupItr; + if (pages.getRowIndexes().isPresent()) { +rowIdxInRowGroupItr = pages.getRowIndexes().get(); + } else { +// If `pages.getRowIndexes()` is empty, this means column indexing has not triggered. Review comment: The name of 'column index' was already used for Page Index in another feature. Can you use something else? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498155#comment-17498155 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814816221 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -227,6 +232,9 @@ public boolean nextKeyValue() throws IOException, InterruptedException { try { currentValue = recordReader.read(); + if (rowIdxInFileItr != null) { Review comment: && hasNext()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498145#comment-17498145 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814797672 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java ## @@ -33,6 +33,7 @@ private long totalByteSize; private String path; private int ordinal; + private long rowIndexOffset; Review comment: In the following toString(), it should be added too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498141#comment-17498141 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814791362 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1476,6 +1509,13 @@ public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata) throws public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata, InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws IOException { +return fromParquetMetadata(parquetMetadata, fileDecryptor, encryptedFooter, generateRowGroupOffsets(parquetMetadata)); Review comment: As you mentioned above, if parquetMetadata is a filtered one, then generateRowGroupOffsets() won't return accurate offsets, correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498138#comment-17498138 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814786613 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; +Map rowGroupToRowIndexOffsetMap; +public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, Map rowGroupToRowIndexOffsetMap) { + this.fileMetadata = fileMetadata; + this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap; +} + } + public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter, final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, final int combinedFooterLength) throws IOException { final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null); final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null); -FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor() { +FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = filter.accept(new MetadataFilterVisitor() { Review comment: Got it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497821#comment-17497821 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1050389884 @shangxinli I have added test to cover old parquet file without column indexes. Please review the changes when you get chance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496233#comment-17496233 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1048018260 I will have another look soon sometime this week. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492997#comment-17492997 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 edited a comment on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. I have addressed most of the comments. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492994#comment-17492994 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205 @shangxinli Thanks fa lot or the review. > [](https://github.com/prakharjain09)We need more test to cover old parquet data that doesn't have column index. I couldn't find any existing tests or any existing parquet files in Resource directory which doesn't have column indexes. Could you please give some pointer to similar existing test or some way to create parquet file without column indexes (don't see any options to disable writing column indexes either)? I have added tests for validating row indexes are correct with "column index filtering" disabled. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492983#comment-17492983 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 We use this to create a reader over the test parquet file so that we can call the new ParquetReader.getRowIndex API for unit testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492982#comment-17492982 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: This is being used by the new test file - https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492981#comment-17492981 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder builder, List use return users; } - public static List readUsers(ParquetReader.Builder builder) throws IOException { Review comment: fixed - not deleting it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492979#comment-17492979 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: current is an existing variable which tracks number of rows already processed. It is initialized to 0 at declaration time. So here we are trying to see if it is still 0, that means we haven't processed any row yet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492978#comment-17492978 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to `rowIdxInFileItr`. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: renamed to shorter name. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIndexWithinFileIterator == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIndex; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowIndexOffsetForCurrentRowGroup = pages.getRowIndexOffset(); Review comment: renamed to shorter name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492977#comment-17492977 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() { private final Map readers = new HashMap(); private final long rowCount; + private final long rowIndexOffset; private final RowRanges rowRanges; - public ColumnChunkPageReadStore(long rowCount) { + public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) { Review comment: Makes sense - retaining the older constructor. ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -265,6 +268,11 @@ public long getRowCount() { return rowCount; } + @Override + public Optional getRowIndexOffset() { +return Optional.of(rowIndexOffset); Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492975#comment-17492975 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; +Map rowGroupToRowIndexOffsetMap; +public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, Map rowGroupToRowIndexOffsetMap) { + this.fileMetadata = fileMetadata; + this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap; +} + } + public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter, final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, final int combinedFooterLength) throws IOException { final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null); final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null); -FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor() { +FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = filter.accept(new MetadataFilterVisitor() { Review comment: The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter filter)` returns the "filtered" FileMetadata and so few of the rowGroups might be missing in that. So doing it at the end will generate incorrect rowIndexOffset. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492974#comment-17492974 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; Review comment: done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491211#comment-17491211 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036746521 We need more test to cover old parquet data that doesn't have column index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491209#comment-17491209 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805065376 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder builder, List use return users; } - public static List readUsers(ParquetReader.Builder builder) throws IOException { Review comment: Removing public method could cause incompatibility issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491210#comment-17491210 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805065458 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java ## @@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder builder, List use } } - private static ParquetReader createReader(Path file, Filter filter) throws IOException { + public static ParquetReader createReader(Path file, Filter filter) throws IOException { Review comment: why make it public? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491207#comment-17491207 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805062241 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; +Map rowGroupToRowIndexOffsetMap; +public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, Map rowGroupToRowIndexOffsetMap) { + this.fileMetadata = fileMetadata; + this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap; +} + } + public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter, final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, final int combinedFooterLength) throws IOException { final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null); final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null); -FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor() { +FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = filter.accept(new MetadataFilterVisitor() { Review comment: I don't see the needs to change each individual implementation. Since generateRowGroupOffsets() only need fileMetadata, can you just call generateRowGroupOffsets() in the end? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491206#comment-17491206 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805060344 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java ## @@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilte return readParquetMetadata(from, filter, null, false, 0); } + private Map generateRowGroupOffsets(FileMetaData metaData) { +Map rowGroupOrdinalToRowIdx = new HashMap<>(); +List rowGroups = metaData.getRow_groups(); +if (rowGroups != null) { + long rowIdxSum = 0; + for (int i = 0; i < rowGroups.size(); i++) { +rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum); +rowIdxSum += rowGroups.get(i).getNum_rows(); + } +} +return rowGroupOrdinalToRowIdx; + } + + /** + * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map. + */ + private class FileMetaDataAndRowGroupOffsetInfo { +FileMetaData fileMetadata; Review comment: use final -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491198#comment-17491198 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805053725 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { + throw new RowIndexFetchedWithoutProcessingRowException("row index can be fetched only after processing a row"); +} +if (rowIndexWithinFileIterator == null) { + throw new RowIndexNotSupportedException("underlying page read store implementation" + +" doesn't support row index generation"); +} +return currentRowIndex; + } + + /** + * Resets the row index iterator based on the current processed row group. + */ + private void resetRowIndexIterator(PageReadStore pages) { +Optional rowIndexOffsetForCurrentRowGroup = pages.getRowIndexOffset(); Review comment: The name is so long -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491196#comment-17491196 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805052859 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, InterruptedException { return Collections.unmodifiableMap(setMultiMap); } + /** + * Returns the ROW_INDEX of the current row. + */ + public long getCurrentRowIndex() { +if (current == 0L) { Review comment: What is the reason not turning -1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491194#comment-17491194 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805052003 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java ## @@ -69,6 +71,8 @@ private long current = 0; private int currentBlock = -1; private ParquetFileReader reader; + private long currentRowIndex = -1L; + private PrimitiveIterator.OfLong rowIndexWithinFileIterator; Review comment: The name is so long -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491180#comment-17491180 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805044955 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -265,6 +268,11 @@ public long getRowCount() { return rowCount; } + @Override + public Optional getRowIndexOffset() { +return Optional.of(rowIndexOffset); Review comment: If the constructor caller cannot have a valid rowIndexOffset, I guess we need to provide an option to return empty. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491174#comment-17491174 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805042201 ## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java ## @@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() { private final Map readers = new HashMap(); private final long rowCount; + private final long rowIndexOffset; private final RowRanges rowRanges; - public ColumnChunkPageReadStore(long rowCount) { + public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) { Review comment: Since it is defined as public, it could break if we don't maintain the original signature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491072#comment-17491072 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036460871 > [](https://github.com/prakharjain09)Can you squash the commits to make the review easier? done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491066#comment-17491066 ] ASF GitHub Bot commented on PARQUET-2117: - shangxinli commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036445953 Can you squash the commits to make the review easier? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490650#comment-17490650 ] Prakhar Jain commented on PARQUET-2117: --- [~sha...@uber.com] [~gszadovszky] Could you please [review the PR|https://github.com/apache/parquet-mr/pull/945] and provide your feedback. Thanks! > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488274#comment-17488274 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1031713582 @shangxinli @gszadovszky Please review the changes when you get chance. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488273#comment-17488273 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 commented on a change in pull request #945: URL: https://github.com/apache/parquet-mr/pull/945#discussion_r800879793 ## File path: parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java ## @@ -43,6 +43,14 @@ */ long getRowCount(); + /** + * + * @return the row index offset of this row group. + */ + default Optional getRowIndexOffset() { Review comment: @shangxinli I added this new method in the interface. This leads to build issues from japicmp-maven-plugin - METHOD_ADDED_TO_INTERFACE. I provided a default value for this method and this seems to solve the issue. Please provide your feedback on this approach/any alternate approach? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487381#comment-17487381 ] ASF GitHub Bot commented on PARQUET-2117: - prakharjain09 opened a new pull request #945: URL: https://github.com/apache/parquet-mr/pull/945 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2117 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: Extended all the ColumnIndexFiltering and BloomFiltering tests to validate the "row index" also. This add coverage unit test coverage for following scenarios for this feature: Parquet V1/V2 with encryption on/off with no-filter/simple-filter/column-index-filter/bloom-filter ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486806#comment-17486806 ] Prakhar Jain commented on PARQUET-2117: --- [~sha...@uber.com] Yes I am working on this. Will share the PR soon. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers
[ https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485949#comment-17485949 ] Xinli Shang commented on PARQUET-2117: -- Thanks for opening this Jira! Look forward to the PR. > Add rowPosition API in parquet record readers > - > > Key: PARQUET-2117 > URL: https://issues.apache.org/jira/browse/PARQUET-2117 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Prakhar Jain >Priority: Major > Fix For: 1.13.0 > > > Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read > parquet file in columnar fashion or record-by-record. > It will be great to extend them to also support rowPosition API which can > tell the position of the current record in the parquet file. > The rowPosition can be used as a unique row identifier to mark a row. This > can be useful to create an index (e.g. B+ tree) over a parquet file/parquet > table (e.g. Spark/Hive). > There are multiple projects in the parquet eco-system which can benefit from > such a functionality: > # Apache Iceberg needs this functionality. It has this implementation > already as it relies on low level parquet APIs - > [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171], > > [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37] > # Apache Spark can use this functionality - SPARK-37980 -- This message was sent by Atlassian Jira (v8.20.1#820001)