subject:"\[jira\] \[Commented\] \(PARQUET\-2117\) Add rowPosition API in parquet record readers"

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-13 Thread Gidon Gershinsky (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553425#comment-17553425
 ] 

Gidon Gershinsky commented on PARQUET-2117:
---

[~sha...@uber.com] Could you add [~prakharjain09] to the Parquet contributors.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-06-08 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551740#comment-17551740
 ] 

Prakhar Jain commented on PARQUET-2117:
---

Resolving this issue as the PR is merged. [~gershinsky] Could you reassign the 
Jira to me?

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.12.3
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-04-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527900#comment-17527900
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

ggershinsky commented on PR #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1109371055

   @prakharjain09 hopefully, we'll resolve the remaining issues at the 
community sync tomorrow, and start working on a cut.




> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-04-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527760#comment-17527760
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on PR #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1109035222

   @ggershinsky Is there any tentative date / rough estimate for when are we 
planning to do RC cut for the next release?




> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-28 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513174#comment-17513174
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

ggershinsky commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1080253613


   @prakharjain09 the upcoming parquet release will include the current master 
(plus a couple of WIP PRs, once they are merged), so this patch will be covered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-27 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513123#comment-17513123
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1080148066


   @shangxinli @ggershinsky Thanks a lot for reviewing this change.
   
   This will unblock SPARK-37980 if this is released as part of upcoming 
parquet release. Do we need to cherry-pick this to any release branch for the 
same?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509355#comment-17509355
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli merged pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509176#comment-17509176
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1072953219


   Thanks @ggershinsky for the review. I have addressed the comments and fixed 
the build issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-19 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509174#comment-17509174
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r830445563



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +275,46 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the row index of the current row. If no row has been processed or 
if the
+   * row index information is unavailable from the underlying @{@link 
PageReadStore}, returns -1.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L || rowIdxInFileItr == null) {
+  return -1;
+}
+return currentRowIdx;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowGroupRowIdxOffset = pages.getRowIndexOffset();
+currentRowIdx = -1;
+if (rowGroupRowIdxOffset.isPresent()) {
+  final PrimitiveIterator.OfLong rowIdxInRowGroupItr;
+  if (pages.getRowIndexes().isPresent()) {
+rowIdxInRowGroupItr = pages.getRowIndexes().get();
+  } else {
+rowIdxInRowGroupItr = LongStream.range(0, 
pages.getRowCount()).iterator();
+  }
+  // Adjust the row group offset in the `rowIndexWithinRowGroupIterator` 
iterator.
+  this.rowIdxInFileItr = new PrimitiveIterator.OfLong() {
+public long nextLong() {
+  return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.nextLong();
+}
+
+public boolean hasNext() {
+  return rowIdxInRowGroupItr.hasNext();
+}
+
+public Long next() {
+  return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.next();
+}
+  };
+} else {

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506940#comment-17506940
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1068061762


   @prakharjain09 After you fix the CI failures, we can merge. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17506762#comment-17506762
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

ggershinsky commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1067671513


   thanks for this change. The PR looks good to me now, I'll add my approval 
after it passes the CI tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-13 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17505405#comment-17505405
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

ggershinsky commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r825408797



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +275,46 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the row index of the current row. If no row has been processed or 
if the
+   * row index information is unavailable from the underlying @{@link 
PageReadStore}, returns -1.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L || rowIdxInFileItr == null) {
+  return -1;
+}
+return currentRowIdx;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowGroupRowIdxOffset = pages.getRowIndexOffset();
+currentRowIdx = -1;
+if (rowGroupRowIdxOffset.isPresent()) {
+  final PrimitiveIterator.OfLong rowIdxInRowGroupItr;
+  if (pages.getRowIndexes().isPresent()) {
+rowIdxInRowGroupItr = pages.getRowIndexes().get();
+  } else {
+rowIdxInRowGroupItr = LongStream.range(0, 
pages.getRowCount()).iterator();
+  }
+  // Adjust the row group offset in the `rowIndexWithinRowGroupIterator` 
iterator.
+  this.rowIdxInFileItr = new PrimitiveIterator.OfLong() {
+public long nextLong() {
+  return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.nextLong();
+}
+
+public boolean hasNext() {
+  return rowIdxInRowGroupItr.hasNext();
+}
+
+public Long next() {
+  return rowGroupRowIdxOffset.get() + rowIdxInRowGroupItr.next();
+}
+  };
+} else {

Review comment:
   nit: could you start the method with checking this condition 
(!rowGroupRowIdxOffset.isPresent()), and then return? Will look cleaner.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-08 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502820#comment-17502820
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1061584498


   @shangxinli Thanks for taking another look. I have addressed all comments 
other [than 
one](https://github.com/apache/parquet-mr/pull/945#discussion_r820928524). 
Please advice on the same. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502719#comment-17502719
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

ggershinsky commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1061409993


   hi guys, I'm OOO (vacation) this week. Can review it next week if helps, but 
feel free to go ahead without waiting for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502447#comment-17502447
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820930501



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
##
@@ -140,6 +140,16 @@ public T read() throws IOException {
 }
   }
 
+  /**
+   * Returns the row index of the last read row. If no row has been processed, 
returns -1.

Review comment:
   fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502445#comment-17502445
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820928524



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReader.java
##
@@ -46,10 +47,19 @@
 
   private static final Path FILE_V1 = createTempFile();
   private static final Path FILE_V2 = createTempFile();
-  private static final List DATA = 
Collections.unmodifiableList(makeUsers(1));
+  private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = 
createPathFromCP("/test-file-with-no-column-indexes-1.parquet");

Review comment:
   @shangxinli It looks like the 
[column-indexes](https://issues.apache.org/jira/browse/PARQUET-1201) are always 
written in the current version of parquet and are not configurable.
   We are already testing the new row index support with and without the column 
index filtering being triggered (as part of TestColumnIndexFiltering). Also the 
new row index feature doesn't rely on column indexes in any way. So we can skip 
the backward compatibility testing and remove this parquet file from resources. 
What do you think about this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502428#comment-17502428
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli edited a comment on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060907004


   I just left some comments. Other than that, it looks good to me. Add 
@ggershinsky in case you have time to have a look. 
   
   Beyond this PR, if the work you are doing in Iceberg/Spark can be done in 
Parquet, please consider adding them to Parquet-mr. With that, it can benefit 
all the applications that need parquet-mr. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502423#comment-17502423
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060907004


   I just left some comments. Other than that, it looks good to me. Add 
@ggershinsky in case you have time to have a look. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502422#comment-17502422
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820904662



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReader.java
##
@@ -46,10 +47,19 @@
 
   private static final Path FILE_V1 = createTempFile();
   private static final Path FILE_V2 = createTempFile();
-  private static final List DATA = 
Collections.unmodifiableList(makeUsers(1));
+  private static final Path STATIC_FILE_WITHOUT_COL_INDEXES = 
createPathFromCP("/test-file-with-no-column-indexes-1.parquet");

Review comment:
   I am not sure if it is a good idea to check in a data file. Can you 
check if it is possible to stop generating offset index in the current version 
of Parquet? 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502413#comment-17502413
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820893950



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
##
@@ -140,6 +140,16 @@ public T read() throws IOException {
 }
   }
 
+  /**
+   * Returns the row index of the last read row. If no row has been processed, 
returns -1.

Review comment:
   Given this is a public method, we need to take care of the Java doc 
decorations. Please refer to other methods in this class and follow the same. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502411#comment-17502411
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r820893950



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java
##
@@ -140,6 +140,16 @@ public T read() throws IOException {
 }
   }
 
+  /**
+   * Returns the row index of the last read row. If no row has been processed, 
returns -1.

Review comment:
   Given this is public method, we need to take care of the Java doc 
decrations 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502159#comment-17502159
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1060365981


   > Could you please look into the PR again?
   > Also could you share information about when are we planning to do 
code-freeze for next minor release? It will be great if we can release this 
change in next minor/patch release so that Apache Spark/other projects get to 
use this functionality sooner.
   
   
   @shangxinli Gentle reminder. Please take a look when you get chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-01 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499689#comment-17499689
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1055744409


   @shangxinli Thanks a lot or the review. I have addressed the review 
comments. Could you please look into the PR again?
   
   Also could you share information about when are we planning to do 
code-freeze for next minor release? It will be great if we can release this 
change in next minor/patch release so that Apache Spark/other projects get to 
use this functionality sooner.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-03-01 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17499687#comment-17499687
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 edited a comment on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks a lot or the review. I have addressed most of the 
comments.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498261#comment-17498261
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815008882



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   There was no specific reason for choosing exception over -1.
   I have updated it to return -1 and also updated all the public method docs 
to reflect the same.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498260#comment-17498260
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815008136



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIdxInFileItr == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIdx;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowGroupRowIdxOffset = pages.getRowIndexOffset();
+currentRowIdx = -1L;
+if (rowGroupRowIdxOffset.isPresent()) {
+  final PrimitiveIterator.OfLong rowIdxInRowGroupItr;
+  if (pages.getRowIndexes().isPresent()) {
+rowIdxInRowGroupItr = pages.getRowIndexes().get();
+  } else {
+// If `pages.getRowIndexes()` is empty, this means column indexing has 
not triggered.

Review comment:
   removed this code comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498259#comment-17498259
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r815007880



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1476,6 +1509,13 @@ public ParquetMetadata fromParquetMetadata(FileMetaData 
parquetMetadata) throws
 
   public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata,
   InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws 
IOException {
+return fromParquetMetadata(parquetMetadata, fileDecryptor, 
encryptedFooter, generateRowGroupOffsets(parquetMetadata));

Review comment:
   Yes thats correct. Fixed this - now we are passing empty Map so that we 
don't populate incorrect rowIndexOffsets.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498252#comment-17498252
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814998446



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -227,6 +232,9 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 
 try {
   currentValue = recordReader.read();
+  if (rowIdxInFileItr != null) {

Review comment:
   done.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java
##
@@ -33,6 +33,7 @@
   private long totalByteSize;
   private String path;
   private int ordinal;
+  private long rowIndexOffset;

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498159#comment-17498159
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814825420



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   I understand why we are checking 'current == 0L'. I was asking why you 
choose throw exception other than returning an invalid value. This is a public 
method. We should have it documented ether way you choose. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498157#comment-17498157
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814818756



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIdxInFileItr == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIdx;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowGroupRowIdxOffset = pages.getRowIndexOffset();
+currentRowIdx = -1L;
+if (rowGroupRowIdxOffset.isPresent()) {
+  final PrimitiveIterator.OfLong rowIdxInRowGroupItr;
+  if (pages.getRowIndexes().isPresent()) {
+rowIdxInRowGroupItr = pages.getRowIndexes().get();
+  } else {
+// If `pages.getRowIndexes()` is empty, this means column indexing has 
not triggered.

Review comment:
   The name of 'column index' was already used for Page Index in another 
feature. Can you use something else?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498155#comment-17498155
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814816221



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -227,6 +232,9 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 
 try {
   currentValue = recordReader.read();
+  if (rowIdxInFileItr != null) {

Review comment:
   && hasNext()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498145#comment-17498145
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814797672



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java
##
@@ -33,6 +33,7 @@
   private long totalByteSize;
   private String path;
   private int ordinal;
+  private long rowIndexOffset;

Review comment:
   In the following toString(), it should be added too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498141#comment-17498141
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814791362



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1476,6 +1509,13 @@ public ParquetMetadata fromParquetMetadata(FileMetaData 
parquetMetadata) throws
 
   public ParquetMetadata fromParquetMetadata(FileMetaData parquetMetadata,
   InternalFileDecryptor fileDecryptor, boolean encryptedFooter) throws 
IOException {
+return fromParquetMetadata(parquetMetadata, fileDecryptor, 
encryptedFooter, generateRowGroupOffsets(parquetMetadata));

Review comment:
   As you mentioned above, if parquetMetadata is a filtered one, then 
generateRowGroupOffsets() won't return accurate offsets, correct?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-25 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17498138#comment-17498138
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r814786613



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;
+Map rowGroupToRowIndexOffsetMap;
+public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, 
Map rowGroupToRowIndexOffsetMap) {
+  this.fileMetadata = fileMetadata;
+  this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap;
+}
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, 
MetadataFilter filter,
   final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter,
   final int combinedFooterLength) throws IOException {
 
 final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? 
fileDecryptor.fetchFooterDecryptor() : null);
 final byte[] encryptedFooterAAD = (encryptedFooter? 
AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
 
-FileMetaData fileMetaData = filter.accept(new 
MetadataFilterVisitor() {
+FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = 
filter.accept(new MetadataFilterVisitor() {

Review comment:
   Got it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-24 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497821#comment-17497821
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1050389884


   @shangxinli I have added test to cover old parquet file without column 
indexes. Please review the changes when you get chance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-22 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17496233#comment-17496233
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1048018260


   I will have another look soon sometime this week. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492997#comment-17492997
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 edited a comment on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review. I have addressed most of the 
comments.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492994#comment-17492994
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1041089205


   @shangxinli Thanks fa lot or the review.
   
   > [](https://github.com/prakharjain09)We need more test to cover old parquet 
data that doesn't have column index.
   
   I couldn't find any existing tests or any existing parquet files in Resource 
directory which doesn't have column indexes. Could you please give some pointer 
to similar existing test or some way to create parquet file without column 
indexes (don't see any options to disable writing column indexes either)?
   I have added tests for validating row indexes are correct with "column index 
filtering" disabled.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492983#comment-17492983
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129
   We use this to create a reader over the test parquet file so that we can 
call the new ParquetReader.getRowIndex API for unit testing.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492982#comment-17492982
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498522



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   This is being used by the new test file - 
https://github.com/apache/parquet-mr/pull/945/files#diff-276ac02899424a4245b8589f8ee6d444e3619f6be17834e4d5d7e81dfdaaee39R129




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492981#comment-17492981
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807498154



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder 
builder, List use
 return users;
   }
 
-  public static List readUsers(ParquetReader.Builder builder) 
throws IOException {

Review comment:
   fixed - not deleting it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492979#comment-17492979
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807497937



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   current is an existing variable which tracks number of rows already 
processed. It is initialized to 0 at declaration time. So here we are trying to 
see if it is still 0, that means we haven't processed any row yet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492978#comment-17492978
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496854



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to `rowIdxInFileItr`.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   renamed to shorter name.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIndexWithinFileIterator == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIndex;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowIndexOffsetForCurrentRowGroup = 
pages.getRowIndexOffset();

Review comment:
   renamed to shorter name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492977#comment-17492977
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496556



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() {
 
   private final Map readers = new 
HashMap();
   private final long rowCount;
+  private final long rowIndexOffset;
   private final RowRanges rowRanges;
 
-  public ColumnChunkPageReadStore(long rowCount) {
+  public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) {

Review comment:
   Makes sense - retaining the older constructor.

##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -265,6 +268,11 @@ public long getRowCount() {
 return rowCount;
   }
 
+  @Override
+  public Optional getRowIndexOffset() {
+return Optional.of(rowIndexOffset);

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492975#comment-17492975
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807496428



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;
+Map rowGroupToRowIndexOffsetMap;
+public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, 
Map rowGroupToRowIndexOffsetMap) {
+  this.fileMetadata = fileMetadata;
+  this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap;
+}
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, 
MetadataFilter filter,
   final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter,
   final int combinedFooterLength) throws IOException {
 
 final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? 
fileDecryptor.fetchFooterDecryptor() : null);
 final byte[] encryptedFooterAAD = (encryptedFooter? 
AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
 
-FileMetaData fileMetaData = filter.accept(new 
MetadataFilterVisitor() {
+FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = 
filter.accept(new MetadataFilterVisitor() {

Review comment:
   The `visit(OffsetMetadataFilter filter)` and `visit(RangeMetadataFilter 
filter)` returns the "filtered" FileMetadata and so few of the rowGroups might 
be missing in that. So doing it at the end will generate incorrect 
rowIndexOffset.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-15 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492974#comment-17492974
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r807495498



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;

Review comment:
   done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491211#comment-17491211
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036746521


   We need more test to cover old parquet data that doesn't have column index. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491209#comment-17491209
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805065376



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -340,12 +342,21 @@ public static void write(ParquetWriter.Builder 
builder, List use
 return users;
   }
 
-  public static List readUsers(ParquetReader.Builder builder) 
throws IOException {

Review comment:
   Removing public method could cause incompatibility issue. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491210#comment-17491210
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805065458



##
File path: 
parquet-hadoop/src/test/java/org/apache/parquet/filter2/recordlevel/PhoneBookWriter.java
##
@@ -315,7 +317,7 @@ public static void write(ParquetWriter.Builder 
builder, List use
 }
   }
 
-  private static ParquetReader createReader(Path file, Filter filter) 
throws IOException {
+  public static ParquetReader createReader(Path file, Filter filter) 
throws IOException {

Review comment:
   why make it public?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491207#comment-17491207
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805062241



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;
+Map rowGroupToRowIndexOffsetMap;
+public FileMetaDataAndRowGroupOffsetInfo(FileMetaData fileMetadata, 
Map rowGroupToRowIndexOffsetMap) {
+  this.fileMetadata = fileMetadata;
+  this.rowGroupToRowIndexOffsetMap = rowGroupToRowIndexOffsetMap;
+}
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, 
MetadataFilter filter,
   final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter,
   final int combinedFooterLength) throws IOException {
 
 final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? 
fileDecryptor.fetchFooterDecryptor() : null);
 final byte[] encryptedFooterAAD = (encryptedFooter? 
AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
 
-FileMetaData fileMetaData = filter.accept(new 
MetadataFilterVisitor() {
+FileMetaDataAndRowGroupOffsetInfo fileMetaDataAndRowGroupInfo = 
filter.accept(new MetadataFilterVisitor() {

Review comment:
   I don't see the needs to change each individual implementation. Since 
generateRowGroupOffsets() only need fileMetadata, can you just call 
generateRowGroupOffsets() in the end? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491206#comment-17491206
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805060344



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##
@@ -1400,34 +1400,67 @@ public ParquetMetadata readParquetMetadata(final 
InputStream from, MetadataFilte
 return readParquetMetadata(from, filter, null, false, 0);
   }
 
+  private Map generateRowGroupOffsets(FileMetaData metaData) {
+Map rowGroupOrdinalToRowIdx = new HashMap<>();
+List rowGroups = metaData.getRow_groups();
+if (rowGroups != null) {
+  long rowIdxSum = 0;
+  for (int i = 0; i < rowGroups.size(); i++) {
+rowGroupOrdinalToRowIdx.put(rowGroups.get(i), rowIdxSum);
+rowIdxSum += rowGroups.get(i).getNum_rows();
+  }
+}
+return rowGroupOrdinalToRowIdx;
+  }
+
+  /**
+   * A container for [[FileMetaData]] and [[RowGroup]] to ROW_INDEX offset map.
+   */
+  private class FileMetaDataAndRowGroupOffsetInfo {
+FileMetaData fileMetadata;

Review comment:
   use final




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491198#comment-17491198
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805053725



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {
+  throw new RowIndexFetchedWithoutProcessingRowException("row index can be 
fetched only after processing a row");
+}
+if (rowIndexWithinFileIterator == null) {
+  throw new RowIndexNotSupportedException("underlying page read store 
implementation" +
+" doesn't support row index generation");
+}
+return currentRowIndex;
+  }
+
+  /**
+   * Resets the row index iterator based on the current processed row group.
+   */
+  private void resetRowIndexIterator(PageReadStore pages) {
+Optional rowIndexOffsetForCurrentRowGroup = 
pages.getRowIndexOffset();

Review comment:
   The name is so long




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491196#comment-17491196
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805052859



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -265,4 +273,51 @@ public boolean nextKeyValue() throws IOException, 
InterruptedException {
 return Collections.unmodifiableMap(setMultiMap);
   }
 
+  /**
+   * Returns the ROW_INDEX of the current row.
+   */
+  public long getCurrentRowIndex() {
+if (current == 0L) {

Review comment:
   What is the reason not turning -1?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491194#comment-17491194
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805052003



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
##
@@ -69,6 +71,8 @@
   private long current = 0;
   private int currentBlock = -1;
   private ParquetFileReader reader;
+  private long currentRowIndex = -1L;
+  private PrimitiveIterator.OfLong rowIndexWithinFileIterator;

Review comment:
   The name is so long




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491180#comment-17491180
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805044955



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -265,6 +268,11 @@ public long getRowCount() {
 return rowCount;
   }
 
+  @Override
+  public Optional getRowIndexOffset() {
+return Optional.of(rowIndexOffset);

Review comment:
   If the constructor caller cannot have a valid rowIndexOffset,  I guess 
we need to provide an option to return empty. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491174#comment-17491174
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r805042201



##
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java
##
@@ -248,15 +248,18 @@ public DictionaryPage readDictionaryPage() {
 
   private final Map readers = new 
HashMap();
   private final long rowCount;
+  private final long rowIndexOffset;
   private final RowRanges rowRanges;
 
-  public ColumnChunkPageReadStore(long rowCount) {
+  public ColumnChunkPageReadStore(long rowCount, long rowIndexOffset) {

Review comment:
   Since it is defined as public, it could break if we don't maintain the 
original signature. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491072#comment-17491072
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036460871


   > [](https://github.com/prakharjain09)Can you squash the commits to make the 
review easier?
   
   done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-11 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17491066#comment-17491066
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

shangxinli commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1036445953


   Can you squash the commits to make the review easier? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-10 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490650#comment-17490650
 ] 

Prakhar Jain commented on PARQUET-2117:
---

[~sha...@uber.com] [~gszadovszky] Could you please [review the 
PR|https://github.com/apache/parquet-mr/pull/945] and provide your feedback. 
Thanks!

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488274#comment-17488274
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#issuecomment-1031713582


   @shangxinli @gszadovszky Please review the changes when you get chance. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-07 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488273#comment-17488273
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 commented on a change in pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945#discussion_r800879793



##
File path: 
parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java
##
@@ -43,6 +43,14 @@
*/
   long getRowCount();
 
+  /**
+   *
+   * @return the row index offset of this row group.
+   */
+  default Optional getRowIndexOffset() {

Review comment:
   @shangxinli I added this new method in the interface. This leads to 
build issues from japicmp-maven-plugin - METHOD_ADDED_TO_INTERFACE.
   
   I provided a default value for this method and this seems to solve the 
issue. Please provide your feedback on this approach/any alternate approach?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-04 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487381#comment-17487381
 ] 

ASF GitHub Bot commented on PARQUET-2117:
-

prakharjain09 opened a new pull request #945:
URL: https://github.com/apache/parquet-mr/pull/945


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-2117
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   Extended all the ColumnIndexFiltering and BloomFiltering tests to validate 
the "row index" also. This add coverage unit test coverage for following 
scenarios for this feature: Parquet V1/V2 with encryption on/off with 
no-filter/simple-filter/column-index-filter/bloom-filter 
   
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-03 Thread Prakhar Jain (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486806#comment-17486806
 ] 

Prakhar Jain commented on PARQUET-2117:
---

 [~sha...@uber.com] Yes I am working on this. Will share the PR soon.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-02 Thread Xinli Shang (Jira)



[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485949#comment-17485949
 ] 

Xinli Shang commented on PARQUET-2117:
--

Thanks for opening this Jira! Look forward to the PR.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

64 matches

Mail list logo