Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17026 )
Change subject: IMPALA-10640: Support reading Parquet Bloom filters - most common types ...................................................................... Patch Set 24: (9 comments) http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17026/24//COMMIT_MSG@7 PS24, Line 7: IMPALA-10640: Support reading Parquet Bloom filters - most common types Do we filter fields in complex types as well, e.g. elements of an array? Or only top-level columns? http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@887 PS24, Line 887: Status bloom_filter_status = ProcessBloomFilter(row_group, We have query options to disable row group/page index filtering. Probably we should add a query option for bloom filtering as well, so in case there's a bug in the code the users will be able to disable it. http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1483 PS24, Line 1483: ReadToBuffer nit: thanks for adding this member function. Can we use it at ParquetPageIndex::ReadAll() as well? http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1497 PS24, Line 1497: cache_options)); nit: fits previous line http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1502 PS24, Line 1502: nit: +2 indent http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@1716 PS24, Line 1716: if (!bloom_filter.Find(hash)) { nit: could you please add some logging at VLOG(3)? E.g. which conjunct filtered which row group in which file? http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h File be/src/exec/parquet/parquet-bloom-filter-util.h: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/exec/parquet/parquet-bloom-filter-util.h@37 PS24, Line 37: May or may not use nit: could you add some details please? When storage is used, and when it isn't? And when is 'ptr' used? http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc File be/src/kudu/util/block_bloom_filter.cc: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/kudu/util/block_bloom_filter.cc@135 PS24, Line 135: nit: too much indentation http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h File be/src/thirdparty/xxhash/xxhash.h: http://gerrit.cloudera.org:8080/#/c/17026/24/be/src/thirdparty/xxhash/xxhash.h@70 PS24, Line 70: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735 > line too long (112 > 90) Probably you could add extend EXCLUDE_FILE_PATTERNS to thirdparty files in bin/jenkins/critique-gerrit-review.py. -- To view, visit http://gerrit.cloudera.org:8080/17026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287 Gerrit-Change-Number: 17026 Gerrit-PatchSet: 24 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 07 Apr 2021 13:29:30 +0000 Gerrit-HasComments: Yes