Gabor Szadovszky created PARQUET-1364:
-----------------------------------------

             Summary: Column Indexes: Invalid row indexes for pages starting 
with nulls
                 Key: PARQUET-1364
                 URL: https://issues.apache.org/jira/browse/PARQUET-1364
             Project: Parquet
          Issue Type: Sub-task
            Reporter: Gabor Szadovszky
            Assignee: Gabor Szadovszky


The current implementation for writing managing row indexes for the pages is 
not reliable. There is a logic 
[MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153]
 which caches null values and flush them just *before* opening a new group. 
This logic might cause starting pages with these cached nulls which are not 
correctly counted in the written rows so the rowIndexes are incorrect. It does 
not cause any issues if all the pages are read continuously put it is a huge 
problem for column index based filtering.
The implementation described above is really complicated and would not like to 
redesign because of the mentioned issue. It is easier to simply count the {{0}} 
repetition levels as record boundaries at the column writer level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to