Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22049 )

Change subject: IMPALA-10319: Support arbitrary encodings on Text files
......................................................................


Patch Set 23:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72
PS21, Line 72: class TestCharCodecGen(ImpalaTestSuite):
> The whole thing takes ~18m on my machine. So I've left only a single gbk en
It seems a bit slow, especially the 32 mins, we could check if there are some 
steps that could be faster or unneeded dimensions.


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py
File tests/query_test/test_charcodec.py:

http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@362
PS23, Line 362:
nit: trailing white space


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@372
PS23, Line 372:     data_file_path = os.path.join(os.environ['IMPALA_HOME'], 
"testdata",
              :         "charcodec", datafile)
              :     self.filesystem_client.copy_from_local(data_file_path,
              :         self._get_table_location("{0}.{1}".format(db, 
tbl_name), vector))
              :     self.execute_query("""REFRESH {}.{}""".format(db, tbl_name))
there are a few function in file_utils.py that helps with such tasks, e.g. 
create_table_and_copy_files()


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@379
PS23, Line 379:   @SkipIfFS.hive
Does this actually need Hive?


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@386
PS23, Line 386:     utf8_table = self.prepare_test_table(vector, db, enc + 
'_names_utf8', enc + '_names_utf8.txt', None)
Too long line (why didn't jenkins warn about this?)


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@395
PS23, Line 395:     result = self.client.execute("""select * from {}.{} except 
select * from {}.{}
              :         union all select * from {}.{} except select * from 
{}.{}"""
There could be a utility function like check_tables_equal that checks this (+ 
probably count(*) to verify that it is not empty)


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@400
PS23, Line 400: f
Does this actually need Hive?


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@410
PS23, Line 410:   @SkipIfFS.hive
Does this actually need Hive?


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@417
PS23, Line 417:     err = self.execute_query_expect_failure(self.client, 
"""insert overwrite {}.{}
nit: trailing white space


http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@432
PS23, Line 432:     result_hive = self.run_stmt_in_hive("""select name from 
{}.{}""".format(db, encoded_table))
              :     result_impala = self.client.execute("""select name from 
{}.{}""".format(db, utf8_table))
              :     result_hive_list = result_hive.strip().split('\n')[1:]
              :     assert result_hive_list == result_impala.data
This may work for tables with a single file, but generally we cannot make 
assumptions about the order of rows in the result if there is no ORDER BY in 
the query.



--
To view, visit http://gerrit.cloudera.org:8080/22049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Gerrit-Change-Number: 22049
Gerrit-PatchSet: 23
Gerrit-Owner: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Mihaly Szjatinya <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 27 May 2025 14:58:43 +0000
Gerrit-HasComments: Yes

Reply via email to