Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text/Sequence files ...................................................................... Patch Set 21: (12 comments) The implementations looks good (with the exception of a coding style comment), finally found time to process the tests. http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22049/21//COMMIT_MSG@7 PS21, Line 7: Sequence What's the situation now when serialization.encoding is set for a Sequence file? http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h File be/src/util/char-codec.h: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.h@54 PS21, Line 54: std::string& result here and in the Handle* functions: Impala has the convention of using pointer in out parameters instead of non-const refs http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@59 PS21, Line 59: assert int(count_utf8.get_data()) == int(count_encoded.get_data()) it would be nice to also pass the expected row count to the function to ensure that both tables are in the expected state - currently the tests could pass even if the tables are empty due to some error during table loading http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@68 PS21, Line 68: TestCharCodecGen general notes on tests: the coverage seems great in general, but I am concerned that most tests would pass even if the encoding was ignore both during reading and writing - a test would be nice where we check if Hive reads correctly an encoded table written by Impala - a test would be nice where a pre-created file (e.g. something minimal added in testdata/data) is read and the result is checked with and without the encoding property Also, encoding error code paths are not tested: - a pre-created file could be tested with encoding error to test the error path in the reader - a string could be written that is not valid utf8 - for example casting the binary col in functional.binary_tbl and writing it to a table with encoding set should result in an error http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72 PS21, Line 72: cls.ImpalaTestMatrix.add_dimension(ImpalaTestDimension( Did you check how much time does the test takes? If it is slow, then it would make sense to only run some encodings in exhaustive jobs. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@74 PS21, Line 74: # There is no reason to run these tests using all dimensions. A ticket could be mentioned that deals with supporting encoding for other file formats. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@79 PS21, Line 79: #################################################################### what is the reason for this line? some comment could be added like "Basic tests" http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@109 PS21, Line 109: self.execute_query("""REFRESH {}.{}""".format(db, tbl_name)) a comment like '# remove REFRESH when IMPALA-13749 is fixed" would be useful to clean these up later http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@127 PS21, Line 127: def test_enc_dec_gen(self, vector, unique_database): Some basic function comment would be nice, e.g. """Write encoded table with Impala and read it back""". It took me some time to understand which read/write paths in Hive/Impala are covered by different tests. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@134 PS21, Line 134: [os.remove(file_path) for file_path in file_paths] I think that it would be better to create a temporary directory that is cleaned up automatically - the current cleanup code would not run in case there is an exception and could leave testdata dir in a dirty state. See https://docs.python.org/3/library/tempfile.html / "import tempfile" in the code on how to do this. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@195 PS21, Line 195: "Snappy", "ZStandard"] The reason for choosing the codecs is that Snappy doesn't support streaming while ZStandard does, right? This could be mentioned in a comment. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@303 PS21, Line 303: self.filesystem_client.make_dir(part_dir) looks duplicated (also see my comment at line 134) -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 21 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 08 May 2025 10:37:42 +0000 Gerrit-HasComments: Yes
