[jira] [Commented] (PARQUET-1982) Allow random access to row groups in ParquetFileReader

ASF GitHub Bot (Jira) Tue, 23 Feb 2021 03:31:07 -0800


    [ 
https://issues.apache.org/jira/browse/PARQUET-1982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289026#comment-17289026
 ]


ASF GitHub Bot commented on PARQUET-1982:
-----------------------------------------

fschmalzel commented on a change in pull request #871:
URL: https://github.com/apache/parquet-mr/pull/871#discussion_r580961322



##########
File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
##########
@@ -920,22 +946,52 @@ public PageReadStore readNextRowGroup() throws 
IOException {
       }
     }
     // actually read all the chunks
-    ChunkListBuilder builder = new ChunkListBuilder();
+    ChunkListBuilder builder = new ChunkListBuilder(block.getRowCount());
     for (ConsecutivePartList consecutiveChunks : allParts) {
       consecutiveChunks.readAll(f, builder);
     }
     for (Chunk chunk : builder.build()) {
-      readChunkPages(chunk, block);
+      readChunkPages(chunk, block, rowGroup);
     }
 
-    // avoid re-reading bytes the dictionary reader is used after this call
-    if (nextDictionaryReader != null) {
-      nextDictionaryReader.setRowGroup(currentRowGroup);
+    return rowGroup;
+  }
+
+  /**
+   * Reads all the columns requested from the specified row group. It may skip 
specific pages based on the column
+   * indexes according to the actual filter. As the rows are not aligned among 
the pages of the different columns row
+   * synchronization might be required. See the documentation of the class 
SynchronizingColumnReader for details.
+   *
+   * @param blockIndex the index of the requested block
+   * @return the PageReadStore which can provide PageReaders for each column 
or null if there are no rows in this block
+   * @throws IOException if an error occurs while reading
+   */
+  public PageReadStore readFilteredRowGroup(int blockIndex) throws IOException 
{

Review comment:
       I think using BlockMetaData is better as it is similar to the existing 
getDictionaryReader(BlockMetaData). For the filtered row group, we need the 
index to access the RowRanges and ColumnIndexStore.
   
   
https://github.com/apache/parquet-mr/pull/871/commits/453a6cc9ef6bda9bd963fab05e59ab6f51389dfa#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10R983
   
   
https://github.com/apache/parquet-mr/pull/871/commits/453a6cc9ef6bda9bd963fab05e59ab6f51389dfa#diff-8da24c84aef62e6e836d073938f7843d289785baaeddf446f3afeae6d4ef4b10R994
   
   What do you think is the better approach?
   Accept BlockMetaData and find the index in the list.
   -or-
   Change the other method signatures to also use indexes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Allow random access to row groups in ParquetFileReader
> ------------------------------------------------------
>
>                 Key: PARQUET-1982
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1982
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Felix Schmalzel
>            Priority: Minor
>              Labels: parquetReader, random-access
>
> The used SeekableInputStream and all other components of the 
> ParquetFileReader already support random access and row groups should be 
> independent of each other.
> This would allow reusing the opened InputStream when you want to go back a 
> row group. It also makes accessing specific row groups a lot easier.
> I've already developed a patch that would enable this functionality. I will 
> link the merge request in the next few days.
> Is there a related ticket that i have overlooked?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1982) Allow random access to row groups in ParquetFileReader

Reply via email to