[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Xinli Shang (Jira) Mon, 19 Oct 2020 09:01:41 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216849#comment-17216849
 ]


Xinli Shang commented on PARQUET-1927:
--------------------------------------

[~gszadovszky], the way that Iceberg Parquet reader iterator implements is that 
it relies on the check of 'valuesRead < totalValues'. When intergrating 
ColumnIndex, we relace readNextRowGroup() with readNextFilteredRowGroup(). 
Because readNextFilteredRowGroup() will skip some records, we change the check 
as 'valuesRead + skippedValues < totalValues'. The skippedValues is calculated 
as 'blockRowCount - counts_Retuned_from_readNextFilteredRowGroup'.This works 
great. But when the whole row group is skipped, readNextFilteredRowGroup() 
advance to next row group internally without Iceberg's knowledge. Hence 
Icerberg doesn't know how to calculate the skippedValues. 

So if readNextFilteredRowGroup() can return how many records it skipped, or 
tell the index of the row group that it gets the returned pages, Iceberg can 
calcuate the skippedValues. 

> ColumnIndex should provide number of records skipped 
> -----------------------------------------------------
>
>                 Key: PARQUET-1927
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1927
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.11.0
>            Reporter: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

Reply via email to