[jira] [Commented] (IMPALA-10416) Testfile can't deal with non-ascii results

ASF subversion and git services (Jira) Tue, 05 Jan 2021 20:48:07 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-10416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259413#comment-17259413
 ]


ASF subversion and git services commented on IMPALA-10416:
----------------------------------------------------------

Commit e7839c4530df7161240eac9852c87a4c37c53fd1 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e7839c4 ]

IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results

Currently, the result section of the testfile is required to used
escaped strings. Take the following result section as an example:
  --- RESULTS
  'Alice\nBob'
  'Alice\\nBob'
The first line is a string with a newline character. The second line is
a string with a '\' and an 'n' character. When comparing with the actual
query results, we need to escape the special characters in the actual
results, e.g. replace newline characters with '\n'. This is done by
invoking encode('unicode_escape') on the actual result strings. However,
the input type of this method is unicode instead of str. When calling it
on str vars, Python will implicitly convert the input vars to unicode
type. The default encoding, ascii, is used. This causes
UnicodeDecodeError when the str contains non-ascii bytes. To fix this,
this patch explicitly decodes the input str using 'utf-8' encoding.

After fixing the logic of escaping the actual result strings, the next
problem is that it's painful to write unicode-escaped expected results.
Here is an example:
  ---- QUERY
  select "你好\n你好"
  ---- RESULTS
  '\u4f60\u597d\n\u4f60\u597d'
  ---- TYPES
  STRING
It's painful to manually translate the unicode characters.

This patch adds a new comment, RAW_STRING, for the result section to use
raw strings instead of unicode-escaped strings. Here is an example:
  ---- QUERY
  select "你好"
  ---- RESULTS: RAW_STRING
  '你好'
  ---- TYPES
  STRING
If the result contains special characters, it's recommended to use the
default string mode. If the special characters only contain newline
characters, we can use RAW_STRING and the existing MULTI_LINE comment
together.

This patch also fixes the issue that pytest fails to report assertion
failures if any of the compared str values contain non-ascii bytes
(IMPALA-10419). However, pytest works if the compared values are both
in unicode type. So we explicitly converting the actual and expected str
values to unicode type.

Test:
 - Add tests in special-strings.test for raw string mode and the escaped
   string mode (default).
 - Run test_exprs.py::TestExprs::test_special_strings locally.

Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b
Reviewed-on: http://gerrit.cloudera.org:8080/16919
Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com>


> Testfile can't deal with non-ascii results
> ------------------------------------------
>
>                 Key: IMPALA-10416
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10416
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Blocker
>
> In the testfile, we can use non-ascii characters in the query. But when the 
> result contains non-ascii characters, the test framework fails to deal with 
> them.
> For instance, I'm currently on master branch (commit 5baadd1). Add a simple 
> test query:
> {code:java}
> diff --git 
> a/testdata/workloads/functional-query/queries/QueryTest/special-strings.test 
> b/testdata/workloads/functional-query/queries/QueryTest/special-strings.test
> index 99a694c..9dfbc97 100644
> --- 
> a/testdata/workloads/functional-query/queries/QueryTest/special-strings.test
> +++ 
> b/testdata/workloads/functional-query/queries/QueryTest/special-strings.test
> @@ -24,3 +24,10 @@ select "'"
>  ---- TYPES
>  STRING
>  ====
> +---- QUERY
> +select "你好"
> +---- RESULTS
> +'你好'
> +---- TYPES
> +STRING
> +==== {code}
> Run the test
> {code:java}
> impala-py.test 
> tests/query_test/test_exprs.py::TestExprs::test_special_strings {code}
> The failure occurs:
> {code:java}
> TestExprs.test_special_strings[protocol: beeswax | exec_option: 
> {'batch_size': 0, 'num_nodes': 0, 'disable_codegen_rows_threshold': 0, 
> 'disable_codegen': False, 'abort_on_error': 1, 
> 'exec_single_node_rows_threshold': 0} | table_format: text/none | 
> enable_expr_rewrites: 1] 
> tests/query_test/test_exprs.py:71: in test_special_strings
>     self.run_test_case('QueryTest/special-strings', vector)
> tests/common/impala_test_suite.py:693: in run_test_case
>     self.__verify_results_and_errors(vector, test_section, result, use_db)
> tests/common/impala_test_suite.py:529: in __verify_results_and_errors
>     replace_filenames_with_placeholder)
> tests/common/test_result_verifier.py:452: in verify_raw_results
>     actual = QueryTestResult(parse_result_rows(exec_result), actual_types,
> tests/common/test_result_verifier.py:493: in parse_result_rows
>     col = cols[i].encode('unicode_escape')
> E   UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: 
> ordinal not in range(128) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10416) Testfile can't deal with non-ascii results

Reply via email to