[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Csaba Ringhofer (Code Review) Thu, 08 May 2025 03:47:40 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )


Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files
......................................................................


Patch Set 21:

(12 comments)

The implementations looks good (with the exception of a coding style comment), 
finally found time to process the tests.

http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG@7
PS21, Line 7: Sequence
What's the situation now when serialization.encoding is set for a Sequence file?


http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h
File be/src/util/char-codec.h:

http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h@54
PS21, Line 54: std::string& result
here and in the Handle* functions: Impala has the convention of using pointer 
in out parameters instead of non-const refs


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@59
PS21, Line 59:     assert int(count_utf8.get_data()) == 
int(count_encoded.get_data())
it would be nice to also pass the expected row count to the function to ensure 
that both tables are in the expected state - currently the tests could pass 
even if the tables are empty due to some error during table loading


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@68
PS21, Line 68: TestCharCodecGen
general notes on tests: the coverage seems great in general, but I am concerned 
that most tests would pass even if the encoding was ignore both during reading 
and writing
- a test would be nice where we check if Hive reads correctly an encoded table 
written by Impala
- a test would be nice where a pre-created file (e.g. something minimal added 
in testdata/data) is read and the result is checked with and without the 
encoding property

Also, encoding error code paths are not tested:
- a pre-created file could be tested with encoding error to test the error path 
in the reader
- a string could be written that is not valid utf8 - for example casting the 
binary col in functional.binary_tbl and writing it to a table with encoding set 
should result in an error


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72
PS21, Line 72:     cls.ImpalaTestMatrix.add_dimension(ImpalaTestDimension(
Did you check how much time does the test takes? If it is slow, then it would 
make sense to only run some encodings in exhaustive jobs.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@74
PS21, Line 74:     # There is no reason to run these tests using all dimensions.
A ticket could be mentioned that deals with supporting encoding for other file 
formats.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@79
PS21, Line 79:   
####################################################################
what is the reason for this line? some comment could be added like "Basic tests"


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@109
PS21, Line 109:     self.execute_query("""REFRESH {}.{}""".format(db, tbl_name))
a comment like '# remove REFRESH when IMPALA-13749 is fixed" would be useful to 
clean these up later


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@127
PS21, Line 127:   def test_enc_dec_gen(self, vector, unique_database):
Some basic function comment would be nice, e.g. """Write encoded table with 
Impala and read it back""". It took me some time to understand which read/write 
paths in Hive/Impala are covered by different tests.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@134
PS21, Line 134:     [os.remove(file_path) for file_path in file_paths]
I think that it would be better to create a temporary directory that is cleaned 
up automatically - the current cleanup code would not run in case there is an 
exception and could leave testdata dir in a dirty state.

See https://docs.python.org/3/library/tempfile.html / "import tempfile" in the 
code on how to do this.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@195
PS21, Line 195: "Snappy", "ZStandard"]
The reason for choosing the codecs is that Snappy doesn't support streaming 
while ZStandard does, right? This could be mentioned in a comment.


http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@303
PS21, Line 303:       self.filesystem_client.make_dir(part_dir)
looks duplicated (also see my comment at line 134)



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 21
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 08 May 2025 10:37:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10319: Support arbitrary encodings on Text/Sequence files

Reply via email to