[
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696626#comment-17696626
]
ASF GitHub Bot commented on PARQUET-2252:
-----------------------------------------
rdblue commented on code in PR #1038:
URL: https://github.com/apache/parquet-mr/pull/1038#discussion_r1125738673
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1011,6 +1012,35 @@ public PageReadStore readFilteredRowGroup(int
blockIndex) throws IOException {
}
RowRanges rowRanges = getRowRanges(blockIndex);
+ return readFilteredRowGroup(blockIndex, rowRanges);
+ }
+
+ /**
+ * Reads all the columns requested from the specified row group. It may skip
specific pages based on the
+ * {@code rowRanges} passed in. As the rows are not aligned among the pages
of the different columns row
+ * synchronization might be required. See the documentation of the class
SynchronizingColumnReader for details.
+ *
+ * @param blockIndex the index of the requested block
+ * @param rowRanges the row ranges to be read from the requested block
+ * @return the PageReadStore which can provide PageReaders for each column
or null if there are no rows in this block
+ * @throws IOException if an error occurs while reading
+ * @throws IllegalArgumentException if the {@code blockIndex} is invalid or
the {@code rowRanges} is null
+ */
+ public ColumnChunkPageReadStore readFilteredRowGroup(int blockIndex,
RowRanges rowRanges) throws IOException {
+ if (blockIndex < 0 || blockIndex >= blocks.size()) {
+ throw new IllegalArgumentException(String.format("Invalid block index
%s, the valid block index range are: " +
+ "[%s, %s]", blockIndex, 0, blocks.size() - 1));
+ }
+
+ if (Objects.isNull(rowRanges)) {
+ throw new IllegalArgumentException("RowRanges must not be null");
+ }
+
+ BlockMetaData block = blocks.get(blockIndex);
+ if (block.getRowCount() == 0L) {
+ throw new ParquetEmptyBlockException("Illegal row group of 0 rows");
Review Comment:
I don't see why this would throw an exception. This method is intended to
allow building an external iterator. I don't think anyone would ever want to
fail if there were an empty row group, even if the reader thinks it shouldn't
have been written. I think this should return null.
> Make some methods public to allow external projects to implement page skipping
> ------------------------------------------------------------------------------
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
> Issue Type: New Feature
> Reporter: Yujiang Zhong
> Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own
> expressions, we would like to be able to use some of the methods in Parquet
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these
> are currently not public. Currently we can only rely on reflection to use
> them.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)