[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

ASF GitHub Bot (Jira) Thu, 02 Mar 2023 04:36:14 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17695692#comment-17695692
 ]


ASF GitHub Bot commented on PARQUET-2252:
-----------------------------------------

zhongyujiang commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1123028615


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int 
blockIndex) throws IOException {
     }
 
     RowRanges rowRanges = getRowRanges(blockIndex);
+    return readFilteredRowGroup(blockIndex, rowRanges);
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the
+   * {@code rowRanges} passed in. As the rows are not aligned among the pages 
of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @param rowRanges the row ranges to be read from the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   * @throws IllegalArgumentException if the {@code blockIndex} is invalid or 
the {@code rowRanges} is null
+   */
+  public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex, 
RowRanges rowRanges) throws IOException {
+    if (blockIndex < 0 || blockIndex >= blocks.size()) {
+      throw new IllegalArgumentException(String.format("Invalid block index 
%s, the valid block index range are: " +
+        "[%s, %s]", blockIndex, 0, blocks.size() - 1));
+    }
+
+    if (Objects.isNull(rowRanges)) {
+      throw new IllegalArgumentException("RowRanges must not be null");
+    }
+
+    BlockMetaData block = blocks.get(blockIndex);
+    if (block.getRowCount() == 0L) {
+      throw new ParquetEmptyBlockException("Illegal row group of 0 rows");

Review Comment:
   I checked PARQUET-2291, seems we only skip empty row group when using reader 
as a iterator, right? We skip empty row group in `readNextRowGroup()` , but not 
when the user passes in a `blockIndex`, and  this newly introduced method also 
requires the user pass in a `blockIndex`.
   





> Make some methods public to allow external projects to implement page skipping
> ------------------------------------------------------------------------------
>
>                 Key: PARQUET-2252
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2252
>             Project: Parquet
>          Issue Type: New Feature
>            Reporter: Yujiang Zhong
>            Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

Reply via email to