Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text files ...................................................................... Patch Set 23: (10 comments) http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72 PS21, Line 72: class TestCharCodecGen(ImpalaTestSuite): > The whole thing takes ~18m on my machine. So I've left only a single gbk en It seems a bit slow, especially the 32 mins, we could check if there are some steps that could be faster or unneeded dimensions. http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@362 PS23, Line 362: nit: trailing white space http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@372 PS23, Line 372: data_file_path = os.path.join(os.environ['IMPALA_HOME'], "testdata", : "charcodec", datafile) : self.filesystem_client.copy_from_local(data_file_path, : self._get_table_location("{0}.{1}".format(db, tbl_name), vector)) : self.execute_query("""REFRESH {}.{}""".format(db, tbl_name)) there are a few function in file_utils.py that helps with such tasks, e.g. create_table_and_copy_files() http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@379 PS23, Line 379: @SkipIfFS.hive Does this actually need Hive? http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@386 PS23, Line 386: utf8_table = self.prepare_test_table(vector, db, enc + '_names_utf8', enc + '_names_utf8.txt', None) Too long line (why didn't jenkins warn about this?) http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@395 PS23, Line 395: result = self.client.execute("""select * from {}.{} except select * from {}.{} : union all select * from {}.{} except select * from {}.{}""" There could be a utility function like check_tables_equal that checks this (+ probably count(*) to verify that it is not empty) http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@400 PS23, Line 400: f Does this actually need Hive? http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@410 PS23, Line 410: @SkipIfFS.hive Does this actually need Hive? http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@417 PS23, Line 417: err = self.execute_query_expect_failure(self.client, """insert overwrite {}.{} nit: trailing white space http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@432 PS23, Line 432: result_hive = self.run_stmt_in_hive("""select name from {}.{}""".format(db, encoded_table)) : result_impala = self.client.execute("""select name from {}.{}""".format(db, utf8_table)) : result_hive_list = result_hive.strip().split('\n')[1:] : assert result_hive_list == result_impala.data This may work for tables with a single file, but generally we cannot make assumptions about the order of rows in the result if there is no ORDER BY in the query. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 23 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 27 May 2025 14:58:43 +0000 Gerrit-HasComments: Yes
