[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Csaba Ringhofer (Code Review) Wed, 07 Jul 2021 04:28:09 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17262 )


Change subject: IMPALA-10642: Write support for Parquet Bloom filters - most 
common types
......................................................................


Patch Set 17:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17262/16//COMMIT_MSG@35
PS16, Line 35:     'testdata/data/parquet-bloom-filtering.parquet' and checks 
whether the
nit: wrap at 72


http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/hdfs-table-sink.h@284
PS16, Line 284:
nit: extra row


http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17262/16/be/src/exec/parquet/hdfs-parquet-table-writer.cc@632
PS16, Line 632:   /// Returns whether we are using Parquet Bloom filtering. It 
can be called in two cases:
              :   /// when deciding whether to initialise 
'parquet_bloom_filter_' and when deciding
              :   /// whether it should be updated (a new value is to be 
inserted). In the first case,
              :   /// 'init' should be set to true, otherwise to false.
optional: using an enum to represent the Bloom filter state machine seems 
easier to understand to me. It could have states like 
uninitialized/enabled/disabled/wait_for_fallback_from_dictionary_encoding and 
make parquet_bloom_filter_tbl_prop_enabled_ unnecessary.


http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java
File 
fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java:

http://gerrit.cloudera.org:8080/#/c/17262/16/fe/src/test/java/org/apache/impala/planner/ParquetBloomFilterTblPropParserTest.java@37
PS16, Line 37:     exp.put("col1", 
HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
             :     exp.put("col2", 
HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
             :     exp.put("col3", 
HdfsTableSink.PARQUET_BLOOM_FILTER_MAX_BYTES);
optional: it could be a bit nicer to fill the map with values in one command.
We are already using Gueva's ImmutableMap.of(...), see 
https://github.com/apache/impala/blob/bb3062197b134f33e2796fac603e3367ab8bef1a/fe/src/main/java/org/apache/impala/catalog/local/DirectMetaProvider.java#L299

see 
https://stackoverflow.com/questions/6802483/how-to-directly-initialize-a-hashmap-in-a-literal-way
 for more details


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@113
PS16, Line 113:         reference_col_to_bloom_filter, col_to_bloom_filter)
can you add a select query and check the profile similarly to 
test_fallback_from_dict? This is probably redundant as the original Parquet 
file has the exaxt same bloom filters, but it would still worth to verify that 
we can use the bloom filters written by Impala.


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@196
PS16, Line 196:       db=unique_database, tbl=tbl_name, col_name=column_name, 
size=bitset_size)
nit: +2 indentation


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225
PS16, Line 225: _create_empty_test_database
naming: This creates a table and not a database.


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@225
PS16, Line 225: unique_database
naming: here and at other helper functions: I would prefer db_name to 
unique_database, as unique_database has a specific meaning for test functions


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@240
PS16, Line 240: _populate_database
naming: this populates a table, not a database


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@320
PS16, Line 320:           with open("exp_dir.bin", "wb") as out:
              :             out.write(exp_directory)
              :           with open("dir.bin", "wb") as out:
              :             out.write(directory)
I don't get it why do we need to write these to files - what is the type of 
exp_directory-directory? If they are byte arrays, can't we simply compare them? 
If printing non-utf8 bytes is a problem, than we could compare them and print 
there hex encoded versions is there is a problem


http://gerrit.cloudera.org:8080/#/c/17262/16/tests/query_test/test_parquet_bloom_filter.py@325
PS16, Line 325: column index
bloom filter?



--
To view, visit http://gerrit.cloudera.org:8080/17262
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie865efd4f0c11b9e111fb94f77d084bf6ee20792
Gerrit-Change-Number: 17262
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Amogh Margoor <amarg...@gmail.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <daniel.bec...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Tamas Mate <tm...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Jul 2021 11:28:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10642: Write support for Parquet Bloom filters - most common types

Reply via email to