Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19328 )
Change subject: IMPALA-11780: Wrong FILE__POSITION values for multi row group Parquet files when page filtering is used ...................................................................... Patch Set 2: (2 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/19328/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19328/2//COMMIT_MSG@20 PS2, Line 20: In the meantime it turned out FILE__POSITION was also not set correctly : in the Parquet late materialization code, as : BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_' : in some code paths. > Wonder why this way not caught by existing tests - there seems to be some v That codepath only causes issues when page filtering is not used. That's why I'm running the tests with PARQUET_READ_STATISTICS set to false as well. The new tests definitely would catch this. http://gerrit.cloudera.org:8080/#/c/19328/2/testdata/data/customer_nested_multiblock_multipage.parquet File testdata/data/customer_nested_multiblock_multipage.parquet: http://gerrit.cloudera.org:8080/#/c/19328/2/testdata/data/customer_nested_multiblock_multipage.parquet@1 PS2, Line 1: PAR1 ̤,( ¦ ( > +1 In this case it's also not too hard to create the table on the fly via Hive. The file positions shouldn't change as we are writing the rows in order. One problematic thing would be if Hive would start to ignore the settings and we wouldn't really test what we intend to test. But for it we can also add some checks, e.g. only one Parquet file is written with 3 row groups and N pages. I also like the idea of using a storage service for such files but it would require us to come up with a consensual decision and need some additional work. How about opening a Jira for that so future data files could be stored outside of the Impala repo? -- To view, visit http://gerrit.cloudera.org:8080/19328 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605 Gerrit-Change-Number: 19328 Gerrit-PatchSet: 2 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tma...@apache.org> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 08 Dec 2022 12:59:09 +0000 Gerrit-HasComments: Yes