[ https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754846#comment-17754846 ]
ASF GitHub Bot commented on PARQUET-1364: ----------------------------------------- zhaochengzhch commented on code in PR #507: URL: https://github.com/apache/parquet-mr/pull/507#discussion_r1295348134 ########## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ########## @@ -84,6 +84,10 @@ private void definitionLevel(int definitionLevel) { private void repetitionLevel(int repetitionLevel) { repetitionLevelColumn.writeInteger(repetitionLevel); + assert pageRowCount == 0 ? repetitionLevel == 0 : true : "Every page shall start on record boundaries"; Review Comment: What is the logic of adding verification here? I have encountered a situation where the valuecount is 0 but the replicationlevel is not 0. Is this situation itself normal? Why do you need to add this check after columnindex > Column Indexes: Invalid row indexes for pages starting with nulls > ----------------------------------------------------------------- > > Key: PARQUET-1364 > URL: https://issues.apache.org/jira/browse/PARQUET-1364 > Project: Parquet > Issue Type: Sub-task > Reporter: Gabor Szadovszky > Assignee: Gabor Szadovszky > Priority: Major > Labels: pull-request-available > > The current implementation for writing managing row indexes for the pages is > not reliable. There is a logic > [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153] > which caches null values and flush them just *before* opening a new group. > This logic might cause starting pages with these cached nulls which are not > correctly counted in the written rows so the rowIndexes are incorrect. It > does not cause any issues if all the pages are read continuously put it is a > huge problem for column index based filtering. > The implementation described above is really complicated and would not like > to redesign because of the mentioned issue. It is easier to simply count the > {{0}} repetition levels as record boundaries at the column writer level. -- This message was sent by Atlassian Jira (v8.20.10#820010)