Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17262 )
Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most common types ...................................................................... Patch Set 17: (11 comments) http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG@35 PS16, Line 35: 'testdata/data/parquet-bloom-filtering.parquet' and checks whether the nit: wrap at 72 http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h File be/src/exec/hdfs-table-sink.h: http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h@284 PS16, Line 284: nit: extra row http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc@632 PS16, Line 632: /// Returns whether we are using Parquet Bloom filtering. It can be called in two cases: : /// when deciding whether to initialise 'parquet_bloom_filter_' and when deciding : /// whether it should be updated (a new value is to be inserted). In the first case, : /// 'init' should be set to true, otherwise to false. optional: using an enum to represent the Bloom filter state machine seems easier to understand to me. It could have states like uninitialized/enabled/disabled/wait_for_fallback_from_dictionary_encoding and make parquet_bloom_filter_tbl_prop_enabled_ unnecessary. http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java File fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java: http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java@37 PS16, Line 37: exp.put("col1", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES); : exp.put("col2", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES); : exp.put("col3", HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES); optional: it could be a bit nicer to fill the map with values in one command. We are already using Gueva's ImmutableMap.of(...), see https://github.com/apache/impala/blob/bb3062197b134f33e2796fac603e3367ab8bef1a/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java#L299 see https://stackoverflow.com/questions/6802483/how-to-directly-initialize-a-hashmap-in-a-literal-way for more details http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py File tests/query_test/test_parquet_bloom_filter.py: http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@113 PS16, Line 113: reference_col_to_bloom_filter, col_to_bloom_filter) can you add a select query and check the profile similarly to test_fallback_from_dict? This is probably redundant as the original Parquet file has the exaxt same bloom filters, but it would still worth to verify that we can use the bloom filters written by Impala. http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@196 PS16, Line 196: db=unique_database, tbl=tbl_name, col_name=column_name, size=bitset_size) nit: +2 indentation http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225 PS16, Line 225: _create_empty_test_database naming: This creates a table and not a database. http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225 PS16, Line 225: unique_database naming: here and at other helper functions: I would prefer db_name to unique_database, as unique_database has a specific meaning for test functions http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@240 PS16, Line 240: _populate_database naming: this populates a table, not a database http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@320 PS16, Line 320: with open("exp_dir.bin", "wb") as out: : out.write(exp_directory) : with open("dir.bin", "wb") as out: : out.write(directory) I don't get it why do we need to write these to files - what is the type of exp_directory-directory? If they are byte arrays, can't we simply compare them? If printing non-utf8 bytes is a problem, than we could compare them and print there hex encoded versions is there is a problem http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@325 PS16, Line 325: column index bloom filter? -- To view, visit http://gerrit.cloudera.org:8080/17262 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792 Gerrit-Change-Number: 17262 Gerrit-PatchSet: 17 Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 07 Jul 2021 11:28:00 +0000 Gerrit-HasComments: Yes