[ 
https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754846#comment-17754846
 ] 

ASF GitHub Bot commented on PARQUET-1364:
-----------------------------------------

zhaochengzhch commented on code in PR #507:
URL: https://github.com/apache/parquet-mr/pull/507#discussion_r1295348134


##########
parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java:
##########
@@ -84,6 +84,10 @@ private void definitionLevel(int definitionLevel) {
 
   private void repetitionLevel(int repetitionLevel) {
     repetitionLevelColumn.writeInteger(repetitionLevel);
+    assert pageRowCount == 0 ? repetitionLevel == 0 : true : "Every page shall 
start on record boundaries";

Review Comment:
   What is the logic of adding verification here? I have encountered a 
situation where the valuecount is 0 but the replicationlevel is not 0. Is this 
situation itself normal? Why do you need to add this check after columnindex





> Column Indexes: Invalid row indexes for pages starting with nulls
> -----------------------------------------------------------------
>
>                 Key: PARQUET-1364
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1364
>             Project: Parquet
>          Issue Type: Sub-task
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>
> The current implementation for writing managing row indexes for the pages is 
> not reliable. There is a logic 
> [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153]
>  which caches null values and flush them just *before* opening a new group. 
> This logic might cause starting pages with these cached nulls which are not 
> correctly counted in the written rows so the rowIndexes are incorrect. It 
> does not cause any issues if all the pages are read continuously put it is a 
> huge problem for column index based filtering.
> The implementation described above is really complicated and would not like 
> to redesign because of the mentioned issue. It is easier to simply count the 
> {{0}} repetition levels as record boundaries at the column writer level.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to