[ https://issues.apache.org/jira/browse/IMPALA-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259413#comment-17259413 ]
ASF subversion and git services commented on IMPALA-10416: ---------------------------------------------------------- Commit e7839c4530df7161240eac9852c87a4c37c53fd1 in impala's branch refs/heads/master from stiga-huang [ https://gitbox.apache.org/repos/asf?p=impala.git;h=e7839c4 ] IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results Currently, the result section of the testfile is required to used escaped strings. Take the following result section as an example: --- RESULTS 'Alice\nBob' 'Alice\\nBob' The first line is a string with a newline character. The second line is a string with a '\' and an 'n' character. When comparing with the actual query results, we need to escape the special characters in the actual results, e.g. replace newline characters with '\n'. This is done by invoking encode('unicode_escape') on the actual result strings. However, the input type of this method is unicode instead of str. When calling it on str vars, Python will implicitly convert the input vars to unicode type. The default encoding, ascii, is used. This causes UnicodeDecodeError when the str contains non-ascii bytes. To fix this, this patch explicitly decodes the input str using 'utf-8' encoding. After fixing the logic of escaping the actual result strings, the next problem is that it's painful to write unicode-escaped expected results. Here is an example: ---- QUERY select "你好\n你好" ---- RESULTS '\u4f60\u597d\n\u4f60\u597d' ---- TYPES STRING It's painful to manually translate the unicode characters. This patch adds a new comment, RAW_STRING, for the result section to use raw strings instead of unicode-escaped strings. Here is an example: ---- QUERY select "你好" ---- RESULTS: RAW_STRING '你好' ---- TYPES STRING If the result contains special characters, it's recommended to use the default string mode. If the special characters only contain newline characters, we can use RAW_STRING and the existing MULTI_LINE comment together. This patch also fixes the issue that pytest fails to report assertion failures if any of the compared str values contain non-ascii bytes (IMPALA-10419). However, pytest works if the compared values are both in unicode type. So we explicitly converting the actual and expected str values to unicode type. Test: - Add tests in special-strings.test for raw string mode and the escaped string mode (default). - Run test_exprs.py::TestExprs::test_special_strings locally. Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b Reviewed-on: http://gerrit.cloudera.org:8080/16919 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Testfile can't deal with non-ascii results > ------------------------------------------ > > Key: IMPALA-10416 > URL: https://issues.apache.org/jira/browse/IMPALA-10416 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure > Reporter: Quanlong Huang > Assignee: Quanlong Huang > Priority: Blocker > > In the testfile, we can use non-ascii characters in the query. But when the > result contains non-ascii characters, the test framework fails to deal with > them. > For instance, I'm currently on master branch (commit 5baadd1). Add a simple > test query: > {code:java} > diff --git > a/testdata/workloads/functional-query/queries/QueryTest/special-strings.test > b/testdata/workloads/functional-query/queries/QueryTest/special-strings.test > index 99a694c..9dfbc97 100644 > --- > a/testdata/workloads/functional-query/queries/QueryTest/special-strings.test > +++ > b/testdata/workloads/functional-query/queries/QueryTest/special-strings.test > @@ -24,3 +24,10 @@ select "'" > ---- TYPES > STRING > ==== > +---- QUERY > +select "你好" > +---- RESULTS > +'你好' > +---- TYPES > +STRING > +==== {code} > Run the test > {code:java} > impala-py.test > tests/query_test/test_exprs.py::TestExprs::test_special_strings {code} > The failure occurs: > {code:java} > TestExprs.test_special_strings[protocol: beeswax | exec_option: > {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, > 'disable_codegen': False, 'abort_on_error': 1, > 'exec_single_node_rows_threshold': 0} | table_format: text/none | > enable_expr_rewrites: 1] > tests/query_test/test_exprs.py:71: in test_special_strings > self.run_test_case('QueryTest/special-strings', vector) > tests/common/impala_test_suite.py:693: in run_test_case > self.__verify_results_and_errors(vector, test_section, result, use_db) > tests/common/impala_test_suite.py:529: in __verify_results_and_errors > replace_filenames_with_placeholder) > tests/common/test_result_verifier.py:452: in verify_raw_results > actual = QueryTestResult(parse_result_rows(exec_result), actual_types, > tests/common/test_result_verifier.py:493: in parse_result_rows > col = cols[i].encode('unicode_escape') > E UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: > ordinal not in range(128) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org