Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19328 )

Change subject: IMPALA-11780: Wrong FILE__POSITION values for multi row group 
Parquet files when page filtering is used
......................................................................


Patch Set 2:

(2 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/19328/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19328/2//COMMIT_MSG@20
PS2, Line 20: In the meantime it turned out FILE__POSITION was also not set 
correctly
            : in the Parquet late materialization code, as
            : BaseScalarColumnReader::SkipRowsInternal() didn't update 
'current_row_'
            : in some code paths.
> Wonder why this way not caught by existing tests - there seems to be some v
That codepath only causes issues when page filtering is not used. That's why 
I'm running the tests with PARQUET_READ_STATISTICS set to false as well. The 
new tests definitely would catch this.


http://gerrit.cloudera.org:8080/#/c/19328/2/testdata/data/customer_nested_multiblock_multipage.parquet
File testdata/data/customer_nested_multiblock_multipage.parquet:

http://gerrit.cloudera.org:8080/#/c/19328/2/testdata/data/customer_nested_multiblock_multipage.parquet@1
PS2, Line 1: PAR1̤,(¦( 
        
> +1
In this case it's also not too hard to create the table on the fly via Hive. 
The file positions shouldn't change as we are writing the rows in order. One 
problematic thing would be if Hive would start to ignore the settings and we 
wouldn't really test what we intend to test. But for it we can also add some 
checks, e.g. only one Parquet file is written with 3 row groups and N pages.

I also like the idea of using a storage service for such files but it would 
require us to come up with a consensual decision and need some additional work. 
How about opening a Jira for that so future data files could be stored outside 
of the Impala repo?



--
To view, visit http://gerrit.cloudera.org:8080/19328
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
Gerrit-Change-Number: 19328
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Anonymous Coward <lipeng...@sensorsdata.cn>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gergely Fürnstáhl <gfurnst...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tma...@apache.org>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Dec 2022 12:59:09 +0000
Gerrit-HasComments: Yes

Reply via email to