Mihaly Szjatinya has posted comments on this change. ( http://gerrit.cloudera.org:8080/22049 )
Change subject: IMPALA-10319: Support arbitrary encodings on Text files ...................................................................... Patch Set 24: (12 comments) http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc File be/src/util/char-codec.cc: http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@61 PS21, Line 61: scoped_mem_track > I meant the variable name could follow the snake-case, as other variable na Oh, ok, why not. http://gerrit.cloudera.org:8080/#/c/22049/21/be/src/util/char-codec.cc@103 PS21, Line 103: prefix > partial_symbol can be used to build up the symbol from the new buffer, as i This would indeed save an extra vector and a few lines of code, but on the other side it would lower function's exception-safety guarantee from strong to basic, by: 1. chaning class-level 'partial_symbol_' in case of throw outside funcion. 2. more importantly, changing 'buf_start' in case of throw outside function, and thus invalidating it. This may be functionally ok with the current code outside the function but generally it's a good design to keep throwing functions with strong guarantee. Besides, it's easier to debug this way. We could probably still get rid of 'success' flag, but I would keep it for consistency with later 'HandleSuffix'. http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/21/tests/query_test/test_charcodec.py@72 PS21, Line 72: class TestCharCodecGen(ImpalaTestSuite): > It seems a bit slow, especially the 32 mins, we could check if there are so If it's not important to have 'disable_codegen' both True and False of exhaustive mode, I've constrained it out to cut time in half. We might also generate smaller tables. http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py File tests/query_test/test_charcodec.py: http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@362 PS23, Line 362: # There is no reason to run these tests using all dimensions. > nit: trailing white space ack http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@372 PS23, Line 372: self.execute_query("""CREATE TABLE IF NOT EXISTS {}.{} (name STRING) : STORED AS TEXTFILE""".format(db, tbl_name)) : if encoding: : self.execute_query("""ALTER TABLE {}.{} SET : SERDEPROPERTIES("serialization.encoding"="{}")""" > there are a few function in file_utils.py that helps with such tasks, e.g. I wasn't aware of that function, but unfortunately it copies the entire folder, while we have a folder with different files for each table. And more generally perhaps it would be more consistent with the above tests where we do this manually for different reasons. http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@379 PS23, Line 379: "charcod > Does this actually need Hive? This comes down simply to 'IS_HDFS' check and I thought many tests use this just as a basic sanity check, not necessarily only with hive involved. Isn't that the case? http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@386 PS23, Line 386: def test_precreated_files(self, vector, unique_database): > Too long line (why didn't jenkins warn about this?) Sorry for this and similar, just hadn't updated small fixes yet. http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@395 PS23, Line 395: vector, db, enc + '_names_none', enc + '_names.txt', None) : with pytest.raises(AssertionError) as exc_info: > There could be a utility function like check_tables_equal that checks this We can use '_compare_tables' from this file above. Good idea. http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@400 PS23, Line 400: t > Does this actually need Hive? Ack http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@410 PS23, Line 410: if enc not in ['gbk', 'shift_jis']: pytest.skip() > Does this actually need Hive? Ack http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@417 PS23, Line 417: @SkipIfFS.hive > nit: trailing white space Ack http://gerrit.cloudera.org:8080/#/c/22049/23/tests/query_test/test_charcodec.py@432 PS23, Line 432: """Write table with Impala and read it back with Hive.""" : db = unique_database : enc = vector.get_value('charset') : > This may work for tables with a single file, but generally we cannot make a Interestingly enough. after adding 'order by name' to both queries the comparison doesn't work anymore. It's probably due to ordering taking place before decoding, which likely deserves a separate discussion. In this test however we have tiny single files. Or if needed we can do an ordering in python here. -- To view, visit http://gerrit.cloudera.org:8080/22049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65 Gerrit-Change-Number: 22049 Gerrit-PatchSet: 24 Gerrit-Owner: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Mihaly Szjatinya <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 28 May 2025 22:52:32 +0000 Gerrit-HasComments: Yes
