[jira] [Commented] (IMPALA-11499) Refactor UrlEncode function to handle special characters
[ https://issues.apache.org/jira/browse/IMPALA-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845907#comment-17845907 ] ASF subversion and git services commented on IMPALA-11499: -- Commit b8a66b0e104f8e25e70fce0326d36c9b48672dbb in impala's branch refs/heads/branch-4.4.0 from pranavyl [ https://gitbox.apache.org/repos/asf?p=impala.git;h=b8a66b0e1 ] IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Reviewed-on: http://gerrit.cloudera.org:8080/21131 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins (cherry picked from commit 85cd07a11e876f3d8773f2638f699c61a6b0dd4c) > Refactor UrlEncode function to handle special characters > > > Key: IMPALA-11499 > URL: https://issues.apache.org/jira/browse/IMPALA-11499 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Assignee: Pranav Yogi Lodha >Priority: Critical > Fix For: Impala 4.5.0 > > > Partition values are incorrectly URL-encoded in backend for unicode > characters, e.g. '运营业务数据' is encoded to '�%FFBF�营业务数据' which is wrong. > To reproduce the issue, first create a partition table: > {code:sql} > create table my_part_tbl (id int) partitioned by (p string) stored as parquet; > {code} > Then insert data into it using partition values containing '运'. They will > fail: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='运营业务数据') values (0) > Query submitted at: 2022-08-16 10:03:56 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > Error(5): Input/output error > [localhost:21050] default> insert into my_part_tbl partition(p='运') values > (0); > Query: insert into my_part_tbl partition(p='运') values (0) > Query submitted at: 2022-08-16 10:04:22 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e335 > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq >
[jira] [Commented] (IMPALA-11499) Refactor UrlEncode function to handle special characters
[ https://issues.apache.org/jira/browse/IMPALA-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845052#comment-17845052 ] ASF subversion and git services commented on IMPALA-11499: -- Commit 85cd07a11e876f3d8773f2638f699c61a6b0dd4c in impala's branch refs/heads/master from pranavyl [ https://gitbox.apache.org/repos/asf?p=impala.git;h=85cd07a11 ] IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Reviewed-on: http://gerrit.cloudera.org:8080/21131 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Refactor UrlEncode function to handle special characters > > > Key: IMPALA-11499 > URL: https://issues.apache.org/jira/browse/IMPALA-11499 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Assignee: Pranav Yogi Lodha >Priority: Critical > > Partition values are incorrectly URL-encoded in backend for unicode > characters, e.g. '运营业务数据' is encoded to '�%FFBF�营业务数据' which is wrong. > To reproduce the issue, first create a partition table: > {code:sql} > create table my_part_tbl (id int) partitioned by (p string) stored as parquet; > {code} > Then insert data into it using partition values containing '运'. They will > fail: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='运营业务数据') values (0) > Query submitted at: 2022-08-16 10:03:56 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > Error(5): Input/output error > [localhost:21050] default> insert into my_part_tbl partition(p='运') values > (0); > Query: insert into my_part_tbl partition(p='运') values (0) > Query submitted at: 2022-08-16 10:04:22 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e335 > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > Error(5): Input/output error > {noformat} > However, partition value without the character '运' is OK: > {noformat}
[jira] [Commented] (IMPALA-11499) Refactor UrlEncode function to handle special characters
[ https://issues.apache.org/jira/browse/IMPALA-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841255#comment-17841255 ] Quanlong Huang commented on IMPALA-11499: - [~daniel.becker] found the root cause in the review: [https://gerrit.cloudera.org/c/21131/6/be/src/util/coding-util.cc#55] The problem is in this string: {code:cpp} static function HiveShouldEscape = is_any_of("\"#%\\*/:=?\u00FF");{code} "\u00FF" is the unicode of ÿ which is encoded into two bytes in UTF-8: 0xc3 {*}0xbf{*}. "运" is encoded into 3 bytes in UTF-8: 0xe8 *0xbf* 0x90. The second byte *0xbf* matches in the set so it's encoded as "%FFBF". The other bytes remain unchanged. That's the problem. We can find more common Chinese characters that could hit this, e.g. * 近: 0xe8 0xbf 0x91 * 返: 0xe8 0xbf 0x94 * 还: 0xe8 0xbf 0x98 * 这: 0xe8 0xbf 0x99 * 进: 0xe8 0xbf 0x9b * 远: 0xe8 0xbf 0x9c > Refactor UrlEncode function to handle special characters > > > Key: IMPALA-11499 > URL: https://issues.apache.org/jira/browse/IMPALA-11499 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Assignee: Pranav Yogi Lodha >Priority: Critical > > Partition values are incorrectly URL-encoded in backend for unicode > characters, e.g. '运营业务数据' is encoded to '�%FFBF�营业务数据' which is wrong. > To reproduce the issue, first create a partition table: > {code:sql} > create table my_part_tbl (id int) partitioned by (p string) stored as parquet; > {code} > Then insert data into it using partition values containing '运'. They will > fail: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='运营业务数据') values (0) > Query submitted at: 2022-08-16 10:03:56 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > Error(5): Input/output error > [localhost:21050] default> insert into my_part_tbl partition(p='运') values > (0); > Query: insert into my_part_tbl partition(p='运') values (0) > Query submitted at: 2022-08-16 10:04:22 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e335 > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > Error(5): Input/output error > {noformat} > However, partition value without the character '运' is OK: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='营业务数据') values (0) > Query submitted at: 2022-08-16 10:04:13 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=b04894bfcfc3836a:b1ac9036 > Modified 1 row(s) in 0.21s > {noformat} > Hive is able to execute all these statements. > I'm able to narrow down the issue into Backend, where we URL-encode the > partition value in HdfsTableSink::InitOutputPartition(): > {code:cpp} > string value_str; > partition_key_expr_evals_[j]->PrintValue(value, _str); > // Directory names containing partition-key values need to be > UrlEncoded, in > // particular to avoid problems when '/' is part of the key value > (which might > // occur,
[jira] [Commented] (IMPALA-11499) Refactor UrlEncode function to handle special characters
[ https://issues.apache.org/jira/browse/IMPALA-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834287#comment-17834287 ] Zoltán Borók-Nagy commented on IMPALA-11499: When fixing this, could you please make sure Iceberg tables also work well? > Refactor UrlEncode function to handle special characters > > > Key: IMPALA-11499 > URL: https://issues.apache.org/jira/browse/IMPALA-11499 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Quanlong Huang >Assignee: Pranav Yogi Lodha >Priority: Critical > > Partition values are incorrectly URL-encoded in backend for unicode > characters, e.g. '运营业务数据' is encoded to '�%FFBF�营业务数据' which is wrong. > To reproduce the issue, first create a partition table: > {code:sql} > create table my_part_tbl (id int) partitioned by (p string) stored as parquet; > {code} > Then insert data into it using partition values containing '运'. They will > fail: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='运营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='运营业务数据') values (0) > Query submitted at: 2022-08-16 10:03:56 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=404ac3027c4b7169:39d16a2d > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/404ac3027c4b7169_39d16a2d/.404ac3027c4b7169-39d16a2d_1475855322_dir/p=�%FFBF�营业务数据/404ac3027c4b7169-39d16a2d_1585092794_data.0.parq > Error(5): Input/output error > [localhost:21050] default> insert into my_part_tbl partition(p='运') values > (0); > Query: insert into my_part_tbl partition(p='运') values (0) > Query submitted at: 2022-08-16 10:04:22 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=a64e5883473ec28d:86e7e335 > ERROR: Error(s) moving partition files. First error (of 1) was: Hdfs op > (RENAME > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > TO > hdfs://localhost:20500/test-warehouse/my_part_tbl/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq) > failed, error was: > hdfs://localhost:20500/test-warehouse/my_part_tbl/_impala_insert_staging/a64e5883473ec28d_86e7e335/.a64e5883473ec28d-86e7e335_1582623091_dir/p=�%FFBF�/a64e5883473ec28d-86e7e335_163454510_data.0.parq > Error(5): Input/output error > {noformat} > However, partition value without the character '运' is OK: > {noformat} > [localhost:21050] default> insert into my_part_tbl partition(p='营业务数据') > values (0); > Query: insert into my_part_tbl partition(p='营业务数据') values (0) > Query submitted at: 2022-08-16 10:04:13 (Coordinator: > http://quanlong-OptiPlex-BJ:25000) > Query progress can be monitored at: > http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=b04894bfcfc3836a:b1ac9036 > Modified 1 row(s) in 0.21s > {noformat} > Hive is able to execute all these statements. > I'm able to narrow down the issue into Backend, where we URL-encode the > partition value in HdfsTableSink::InitOutputPartition(): > {code:cpp} > string value_str; > partition_key_expr_evals_[j]->PrintValue(value, _str); > // Directory names containing partition-key values need to be > UrlEncoded, in > // particular to avoid problems when '/' is part of the key value > (which might > // occur, for example, with date strings). Hive will URL decode the > value > // transparently when Impala's frontend asks the metastore for > partition key values, > // which makes it particularly important that we use the same encoding > as Hive. It's > // also not necessary to encode the values when writing partition > metadata. You can > // check this with 'show partitions ' in Hive, followed by a > select from a > // decoded partition key value. > string encoded_str; > UrlEncode(value_str, _str, true); > string part_key_value = (encoded_str.empty() ? >