[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Reviewed-on: http://gerrit.cloudera.org:8080/21131 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 152 insertions(+), 20 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 29 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 28: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 28 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 09 May 2024 15:09:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 27: Code-Review+2 Thanks Pranav! -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 09 May 2024 10:06:59 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 28: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 28 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 09 May 2024 10:07:20 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 28: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10624/ DRY_RUN=false -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 28 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 09 May 2024 10:07:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 27: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 21:48:13 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 27: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16140/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 18:39:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 27: Verified+1 Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 18:15:52 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 27: (1 comment) > Uploaded patch set 27. http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py@343 PS26, Line 343: > This. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 18:12:34 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#27). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 152 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/27 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 27 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py@343 PS26, Line 343: = > flake8: E225 missing whitespace around operator This. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 18:01:35 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 26: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 15:12:37 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 26: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10619/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 08 May 2024 10:43:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 26: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16122/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 21:16:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#26). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 152 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/26 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 26: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/26/tests/query_test/test_insert.py@343 PS26, Line 343: = flake8: E225 missing whitespace around operator -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 26 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 20:51:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 25: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 25 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 20:34:57 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 25: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16120/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 25 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 20:21:09 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 24: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/16119/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 24 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 20:18:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 25: (5 comments) > Uploaded patch set 25. http://gerrit.cloudera.org:8080/#/c/21131/23/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/23/be/src/util/coding-util-test.cc@94 PS23, Line 94: string test_path = "/home/impala/directory/"; > These strings could also be extracted into variables. Done http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@320 PS23, Line 320: is > "is a" Done http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@320 PS23, Line 320: the result > "the result line..." Done http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@321 PS23, Line 321: we only verify that the value is present in str > "we only verify that the value is present in string representation of the w Done http://gerrit.cloudera.org:8080/#/c/21131/24/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/24/tests/query_test/test_insert.py@301 PS24, Line 301: "create table {}(i int) partitioned by (p string) stored as parquet".format(tbl)) I've added a new test for iceberg, since there are a lot of differences in how we create, insert and how show partition works in iceberg, I've skipped the helper function to improve the readabilty. The only thing that was left common were a few asserts and the strings. I also think that if we're to add any new tests for iceberg, it'd be easier and more convenient if it'll have a separate function. I've an optimized version for the same as well and have shared the same with you, but I think it just gets more complicated when we'll need to handle iceberg and parquet tables separately because of their show partition evaluation. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 25 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 19:56:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 141 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/25 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 25 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#24). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 141 insertions(+), 20 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/24 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 24 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@301 PS23, Line 301: "create table {}(i int) partitioned by (p string) stored as parquet".format(tbl)) One more thing: you should run this test with an Iceberg table too. You could extract the common parts (mostly everything except for the create and insert statements) into a helper function. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 14:19:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@320 PS23, Line 320: result line "the result line..." http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@321 PS23, Line 321: the values present inside it are just verified. "we only verify that the value is present in string representation of the whole row" -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 12:46:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: (2 comments) Just a few nits. http://gerrit.cloudera.org:8080/#/c/21131/23/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/23/be/src/util/coding-util-test.cc@94 PS23, Line 94: TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", false); These strings could also be extracted into variables. http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/23/tests/query_test/test_insert.py@320 PS23, Line 320: are "is a" -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 12:24:14 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16108/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 06:25:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: Code-Review+1 LGTM -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 06:23:27 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 23: (1 comment) > Uploaded patch set 23. http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py@310 PS22, Line 310: \ > We need double back slashes to fix this, i.e. "\\" Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 06:00:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#23). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 113 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/23 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 23 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 22: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16107/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 22 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 05:50:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py@310 PS22, Line 310: \ > flake8: W605 invalid escape sequence '\{' We need double back slashes to fix this, i.e. "\\" -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 22 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 05:33:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 22: (1 comment) > Uploaded patch set 22. http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@299 PS20, Line 299: tb > Not done yet. In PS21, it's still using 4 spaces. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 22 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 05:25:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#22). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 113 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/22 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 22 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 22: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/22/tests/query_test/test_insert.py@310 PS22, Line 310: \ flake8: W605 invalid escape sequence '\{' -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 22 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 05:25:28 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 21: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16106/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 05:03:40 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 21: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@299 PS20, Line 299: > Done Not done yet. In PS21, it's still using 4 spaces. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 04:49:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 21: (5 comments) > Patch Set 21: > > (11 comments) > > > Uploaded patch set 21. http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@96 PS20, Line 96: string test = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10" > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@301 PS21, Line 301: ) > flake8: E501 line too long (91 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@310 PS21, Line 310: \ > flake8: W605 invalid escape sequence '\{' Done http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@311 PS21, Line 311: \ > flake8: E501 line too long (91 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@320 PS21, Line 320: e > flake8: E501 line too long (94 > 90 characters) Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 04:39:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 21: (11 comments) > Uploaded patch set 21. http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc@97 PS19, Line 97: "\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x7F\"#%'" > Why is the single quote escaped? Done http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@98 PS20, Line 98: "*/:=?\\{[]^"; > Please add '^' Done http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@104 PS20, Line 104: } > nit: remove blank line Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@297 PS20, Line 297: def test_escaped_partition_values(self, unique_database): > nit: move this inside the method, i.e. Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@299 PS20, Line 299: > nit: use 2 spaces indention Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@300 PS20, Line 300: > wrap this to be Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@301 PS20, Line 301: ) > flake8: E501 line too long (104 > 90 characters) Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@302 PS20, Line 302: > Single quote is missing here. The two single quotes in the middle doesn't r Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@303 PS20, Line 303: > wrap this to be Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@305 PS20, Line 305: special_characters = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r" \ > This just verifies the partition value. We still need to verify the partiti Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@306 PS20, Line 306: "\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B" \ > Can you add a comment about why splitting this out was necessary? "" is used to represent a backslash as it gets escaped again during parsing of the insert statement. It is no more needed as we introduce a part_value variable to store the correct string. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 04:38:56 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 21: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@301 PS21, Line 301: ) flake8: E501 line too long (91 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@310 PS21, Line 310: \ flake8: W605 invalid escape sequence '\{' http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@311 PS21, Line 311: \ flake8: E501 line too long (91 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/21/tests/query_test/test_insert.py@320 PS21, Line 320: e flake8: E501 line too long (94 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 04:38:50 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#21). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test, coding-util-test.cc and test_insert.py. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 113 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/21 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 21 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 20: (8 comments) http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@98 PS20, Line 98: "*/:=?\\{[]"; Please add '^' http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@104 PS20, Line 104: nit: remove blank line http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@297 PS20, Line 297: """Test for special characters in partition values.""" nit: move this inside the method, i.e. def test_escaped_partition_values(self, unique_database): """Test for special characters in partition values.""" http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@299 PS20, Line 299: nit: use 2 spaces indention http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@300 PS20, Line 300: e > flake8: E501 line too long (106 > 90 characters) wrap this to be self.execute_query( "create table {}(i int) partitioned by (p string) stored as parquet".format(tbl)) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@302 PS20, Line 302: 7 > flake8: E501 line too long (108 > 90 characters) Single quote is missing here. The two single quotes in the middle doesn't represent a single quote. They become the end and start of two strings. We can double quote the string and wrap it as special_characters_ = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r" \ "\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B" \ "\x1C\x1D\x1E\x1F\"\x7F'%*/:=?\\{[]#^" Also add '^' and remove "()" http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@303 PS20, Line 303: _ > flake8: E501 line too long (104 > 90 characters) wrap this to be self.execute_query( "insert into {} partition(p='{}') values (0)".format(tbl, special_characters_)) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@305 PS20, Line 305: assert a.data[0] == special_characters_ This just verifies the partition value. We still need to verify the partition dir created by Impala is correct. This seems better: def test_escaped_partition_values(self, unique_database): """Test for special characters in partition values.""" tbl = unique_database + ".tbl" self.execute_query( "create table {}(i int) partitioned by (p string) stored as parquet".format(tbl)) # Use "\\'" for single quote since the insert statement use single quote on the # partition value. Use "" for back slash since it will be escaped again in # parsing the insert statement special_characters = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r" \ "\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B" \ "\x1C\x1D\x1E\x1F\"\x7F\\'%*/:=?{[]#^" part_value = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r" \ "\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B" \ "\x1C\x1D\x1E\x1F\"\x7F'%*/:=?\{[]#^" part_dir = "p=SpecialCharacters%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11" \ "%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%22%7F%27%25%2A%2F%3A%3D%3F" \ "%5C%7B%5B%5D%23%5E" res = self.execute_query( "insert into {} partition(p='{}') values (0)".format(tbl, special_characters)) assert res.data[0] == part_dir + ": 1" res = self.client.execute("select p from {}".format(tbl)) assert res.data[0] == part_value res = self.execute_query("show partitions " + tbl) # There are "\t" in the partition value so we can't easily split the result line. # Just verify the values are inside it. assert part_value in res.data[0] assert part_dir in res.data[0] -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 20 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 07 May 2024 03:26:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 20: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16102/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 20 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 21:46:01 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 19: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16101/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 19 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 21:44:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 20: Code-Review+1 (3 comments) http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc@97 PS19, Line 97: "\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x7F\"#%'" > Why is the single quote escaped? Done http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@302 PS19, Line 302: '\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\"\x7F''()%*/:=?{[]#') > This looks like you forgot to escape the double quote. The rest are omitted Done http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@306 PS20, Line 306: self.execute_query("insert into {} partition(p='') values (0)".format(tbl)) Can you add a comment about why splitting this out was necessary? -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 20 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 21:21:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#20). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 100 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/20 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 20 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#19). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test M tests/query_test/test_insert.py 5 files changed, 96 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/19 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 19 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 20: (5 comments) http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/20/be/src/util/coding-util-test.cc@96 PS20, Line 96: string test = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11" line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@300 PS20, Line 300: e flake8: E501 line too long (106 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@301 PS20, Line 301: E flake8: E501 line too long (104 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@302 PS20, Line 302: 7 flake8: E501 line too long (108 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/20/tests/query_test/test_insert.py@303 PS20, Line 303: _ flake8: E501 line too long (104 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 20 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 21:21:11 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 19: (7 comments) http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/19/be/src/util/coding-util-test.cc@96 PS19, Line 96: string test = "SpecialCharacters\x01\x02\x03\x04\x05\x06\x07\b\t\n\v\f\r\x0E\x0F\x10\x11" line too long (91 > 90) http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py File tests/query_test/test_insert.py: http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@300 PS19, Line 300: e flake8: E501 line too long (106 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@301 PS19, Line 301: E flake8: E501 line too long (104 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@302 PS19, Line 302: " flake8: E501 line too long (107 > 90 characters) http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@302 PS19, Line 302: # flake8: E261 at least two spaces before inline comment http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@302 PS19, Line 302: # flake8: E262 inline comment should start with '# ' http://gerrit.cloudera.org:8080/#/c/21131/19/tests/query_test/test_insert.py@303 PS19, Line 303: _ flake8: E501 line too long (104 > 90 characters) -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 19 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 21:19:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 18: (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/18/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/18/be/src/util/coding-util-test.cc@96 PS18, Line 96: TestUrl("Impala\x01""dev\x02""\tenv\v\x14\x12""for\x1D\x05\b\x03\x04" : "\x05\x06\x07\x09\x0C\r\x0E""beginners\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18" : "\x19\x10""Amuststart\x1B\x1C\x1D\x1E\x1F\x7F", : "Impala%01dev%02%09env%0b%14%12for%1d%05" : "%08%03%04%05%06%07%09%0c%0d%0ebeginners%0f%10%11%12%13%14%15%16%17%18%19%10Amust" : "start%1b%1c%1d%1e%1f%7f", false); : TestUrl("Impala\x01""dev\x02""\tenv\v\x14\x12""for\x1D\x05\b\x03\x04\x05\x06" : "\x07\x09\x0C\r\x0E""beginners\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19" : "\x10""Amuststart\x1B\x1C\x1D\x1E\x1F\x7F", : "Impala%01dev%02%09env%0b%14%12for%1d" : "%05%08%03%04%05%06%07%09%0c%0d%0ebeginners%0f%10%11%12%13%14%15%16%17%18%19%10" : "Amuststart%1b%1c%1d%1e%1f%7f", true); The tests look a bit messy for me. Not sure if all escaped characters are covered. Can we add a simple case that the input string just consists of all the escaped characters? Please extract the strings into variables to avoid writing them twice. If the following tests don't add more coverage, we can probably remove them. Just keep this one and add the simple case before this. If they do provide more coverage, please add some comments to explain it. http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66 PS15, Line 66: !isalnum(static_cast(ch)) && !ShouldNotEscape(ch))) { > Done. We can write the test in python codes instead of using test files, e.g. add this in tests/query_test/test_insert.py under class TestInsertPartKey: def test_escaped_partition_values(self, unique_database): tbl = unique_database + ".tbl" self.execute_query("create table {}(i int) partitioned by(p string)".format(tbl)) import struct # TODO: add more values for b in range(1, 32): stmt = "insert into {} partition(p='{}') values (0)".format(tbl, struct.pack("B", b)) self.execute_query(stmt) res = self.execute_query("show partitions " + tbl) # TODO: verify more values for i in range(8): columns = res.data[i].split("\t") # Verify the partition value assert columns[0] == struct.pack("B", i+1) # Verify the partition location assert "p=%0" + str(i+1) in columns[8] assert len(res.data) == 32 Note that we still use python2 in tests so we need 'struct.pack("B", b)' to create a byte string. In python3, we can use bytes directly. The above test is incomplete. Please resolve the TODOs. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 18 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 07:15:04 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 18: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16095/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 18 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 04:54:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 18: (2 comments) > Uploaded patch set 18. http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66 PS15, Line 66: !isalnum(static_cast(ch)) && !ShouldNotEscape(ch))) { > I played with it a bit and it doesn't work with for example '\t' and '\n' ( Done. Note: I tested those cases manually but to include those tests we'll have to bring a change in the test framework, which in its current state prevents us to write such tests. For example, '\t ' is considered a separator in our current test framework, so to incorporate such tests we might have to change the test framework which might lead to other errors, That's why I haven't included such cases in the current tests. http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66 PS15, Line 66: c > I think it would be good if we first cast it to unsigned char because of wh Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 18 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 06 May 2024 04:37:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#18). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 177 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/18 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 18 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 15: (2 comments) Thanks Pranav! In the meantime I discovered something else, but that shouldn't be too hard to fix. http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66 PS15, Line 66: ss << '%' << static_cast(ch); I played with it a bit and it doesn't work with for example '\t' and '\n' (for example insert into my_part_tbl partition(p='a\tt') values (0); ). Let's take '\t'. Its value is 9, which only has a single hexadecimal digit. Now we'll escape it as "%9" but Hive has "%09". To match Hive's behaviour, do this: 1. Add this include in the appropriate include group: #include 2. Add "<< setfill('0')" after "hex" on L58. 3. Add "<< setw(2)" after inserting the % sign on L66. This will pad hex values so that they are always at least 2 chars wide. Note that setw is reset after insertion, so it has to be within the loop. This also reveals a test deficiency: we should ideally test every character that is in 'SpecialCharacters'. I don't know if we can insert arbitrary byte values in Impala strings, maybe Quanlong can help there. But even if that's not possible, we should cover the ones for which an escape sequence exists (i.e. '\t', '\n', '\v' etc.). http://gerrit.cloudera.org:8080/#/c/21131/15/be/src/util/coding-util.cc@66 PS15, Line 66: ch I think it would be good if we first cast it to unsigned char because of what happens if we encounter a character > 127. This is possible if the input contains one, and we are in non-Hive-compat mode. If char is a signed type (that is implementation dependent), then the value > 127 will be interpreted as a negative number. When we convert it to uint32_t, it will be sign-extended. For example, instead of "%FF" we will have "%". If we convert it first to unsigned char, it will not be sign-extended. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 15 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 14:01:31 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 15: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16065/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 15 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 12:11:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 14: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16064/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 14 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 12:08:30 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 15: (1 comment) > Uploaded patch set 15: Commit message was updated. http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@18 PS13, Line 18: makes thr > I think that's inferred. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 15 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 11:45:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#15). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch makes three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 177 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/15 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 15 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 14: (4 comments) > Uploaded patch set 14: Commit message was updated. http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@18 PS13, Line 18: thr > I've almost forgotten the third: the list of characters to be escaped was e Done http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@18 PS13, Line 18: addresses > makes I think that's inferred. http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@19 PS13, Line 19: the set of charac > I didn't mean it as a literal text to be included. Instead, it could be: Done http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@25 PS13, Line 25: n > Nit: "... byte and whose inclusion ...". Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 14 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 11:44:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch addresses three key changes: 1. Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead. 2. '\xFF', which is an invalid UTF-8 byte and whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. 3. The list of characters to be escaped is extended to match the current list in Hive. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 177 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/14 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 14 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 13: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16063/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 13 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 11:38:56 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 13: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@18 PS13, Line 18: addresses makes http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@18 PS13, Line 18: two I've almost forgotten the third: the list of characters to be escaped was extended to match the current list in Hive. http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@19 PS13, Line 19: there was a "set" I didn't mean it as a literal text to be included. Instead, it could be: "Before the change, the set of characters that need to be escaped was stored as a string. The current patch uses an unordered_set instead." http://gerrit.cloudera.org:8080/#/c/21131/13//COMMIT_MSG@25 PS13, Line 25: Nit: "... byte and whose inclusion ...". -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 13 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 11:31:32 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 13: (6 comments) > Uploaded patch set 13. http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@15 PS12, Line 15: " > Use double quotes here ("\u00FF"), this is not a valid character literal bu Done http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: wo key changes: > "the special characters that need to be escaped" would be better, it's not Done http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: > I think it could be formulated better. There are two things that have chang Done http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: > Nit: no need for the comma here. Done http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc@19 PS12, Line 19: > Let's fix include groups: there should be an empty line after #include "uti Done http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc@50 PS12, Line 50: '\f', > This could be '\f' (form feed), see https://www.compart.com/en/unicode/U+00 Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 13 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 11:13:58 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the "\u00FF" literal. Inclusion of "\u00FF" was likely a mistake from the beginning and it should have been '\x7F'. The patch addresses two key changes: 1. Before the change, there was a "set" of characters that needed to be encoded, which were stored as a string. The current patch uses an unordered_set which contains all the special characters that needs to be escaped to handle the character encoding efficiently. 2. '\xFF', which is an invalid UTF-8 byte whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 177 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/13 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 13 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 12: (6 comments) Thanks Pranav, we're close. http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@15 PS12, Line 15: ' Use double quotes here ("\u00FF"), this is not a valid character literal but it is a valid string literal. Same also later on this line. http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: the special characters "the special characters that need to be escaped" would be better, it's not all special chars. http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: , Nit: no need for the comma here. http://gerrit.cloudera.org:8080/#/c/21131/12//COMMIT_MSG@18 PS12, Line 18: included I think it could be formulated better. There are two things that have changed: 1. '\xFF' was corrected to '\x7F' and 2. before this change we also had a "set" of characters that needed to be encoded, but it used to be stored as a string and now it's stored as an unordered_set (which is much better in my opinion). I think we should mention these separately. http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc@19 PS12, Line 19: #include Let's fix include groups: there should be an empty line after #include "util/coding-util.h": after it come the standard library includes. Also, leave a line after because the boost and sasl includes are includes from other libraries than std. Usually we group includes like this: 1. header belonging to the current .cc file 2. standard library includes 3. other (third party) library includes 4. impala includes Each group is ordered alphabetically and there is an empty line between groups. http://gerrit.cloudera.org:8080/#/c/21131/12/be/src/util/coding-util.cc@50 PS12, Line 50: '\x0C' This could be '\f' (form feed), see https://www.compart.com/en/unicode/U+000C. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 12 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 10:36:08 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 12: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16060/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 12 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 05:39:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 12: (16 comments) > Uploaded patch set 12. http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@9 PS11, Line 9: An > An error - we're introducing it now. Done http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@14 PS11, Line 14: matched one of th > matched one of the ... Done http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@15 PS11, Line 15: '\u00FF' > '\u00FF' is not a valid character literal in C++. "\u00FF" was included in Done http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@18 PS11, Line 18: A set > See the comment above, we should explain that it was a mistake from the beg Done http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22 PS11, Line 22: > Parquet and Iceberg are not exclusive things, we also use Parquet in our Ic Done http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22 PS11, Line 22: > It would be good to include in which file the new tests are. Done http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@51 PS11, Line 51: '\x1A > Why adding space character here? Hive doesn't escape it in partition paths. Done http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@57 PS11, Line 57: "upp > It would be clearer like this: "uppercase" and "hex" only affect the insert Done http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@62 PS11, Line 62: // the character is not alphanumeric and it is not one of the characters specifically > This doesn't match the code. Try this: Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@450 PS11, Line 450: 'R > We just need double single quotes in the RESULTS section to escape a single Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@452 PS11, Line 452: > I don't think need this. It's used for non-ascii characters. Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@470 PS11, Line 470: > This will fail if we escape space. Ack http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test File testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test: http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@327 PS11, Line 327: TYPES > You don't need to drop the table, the tables are created in a unique databa Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@337 PS11, Line 337: > I don't think this is needed. Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@344 PS11, Line 344: > You should do the same insertions into the Iceberg table as into the other Done http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@354 PS11, Line 354: > See L327. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 12 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 05:14:02 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters An error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte matched one of the two bytes that represented the '\u00FF' literal.Inclusion of '\u00FF' was likely a mistake from the beginning and it should have been '\x7F'. A set containing all the special characters has been included and, '\xFF', which is an invalid UTF-8 byte whose inclusion was erroneous from the beginning, is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Tests on both traditional Hive tables and Iceberg tables are included in unicode-column-name.test, insert.test and coding-util-test.cc. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 178 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/12 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 12 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 11: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@51 PS11, Line 51: '\x20' > For those characters that have a readable literal representation, e.g. " ", Why adding space character here? Hive doesn't escape it in partition paths. E.g. beeline -u "jdbc:hive2://localhost:11050" hive> create table my_part_tbl(i int) partitioned by (p string); hive> alter table my_part_tbl add partition(p='a b'); hive> show partitions my_part_tbl; ++ | partition | ++ | p=a b | ++ hive> describe formatted my_part_tbl partition(p='a b'); ... | Location: | hdfs://localhost:20500/test-warehouse/my_part_tbl/p=a b | NULL | ... http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@450 PS11, Line 450: '' We just need double single quotes in the RESULTS section to escape a single quote. Don't need to do this in the QUERY section. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@452 PS11, Line 452: RAW_STRING I don't think need this. It's used for non-ascii characters. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/insert.test@470 PS11, Line 470: This will fail if we escape space. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 11 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 30 Apr 2024 01:47:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 11: (17 comments) http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@9 PS11, Line 9: The An error - we're introducing it now. http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@14 PS11, Line 14: was matching with matched one of the ... http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@15 PS11, Line 15: '\u00FF' '\u00FF' is not a valid character literal in C++. "\u00FF" was included in a string, it was part of a string literal. At last I found out that it can actually be used in C++ to specify arbitrary unicode values, but of course it may be encoded on multiple bytes, see https://en.cppreference.com/w/cpp/language/escape. Also include in the commit message that including this character was probably a mistake from the beginning, it should have been '\x7F'. http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@18 PS11, Line 18: '\xFF' See the comment above, we should explain that it was a mistake from the beginning. http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22 PS11, Line 22: Some It would be good to include in which file the new tests are. http://gerrit.cloudera.org:8080/#/c/21131/11//COMMIT_MSG@22 PS11, Line 22: for both parquet and iceberg Parquet and Iceberg are not exclusive things, we also use Parquet in our Iceberg tables. Parquet is a file format and Iceberg is a table format. We could say "tests on both traditional Hive tables and Iceberg tables ...". http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@25 PS11, Line 25: #include The includes within a group should be in alphabetical order, so this should go before L23. http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@31 PS11, Line 31: #include "sasl/saslutil.h" Includes from the standard library should be the first group after the header, i.e. this should come after #include . http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@51 PS11, Line 51: '\x20' > Represents space character For those characters that have a readable literal representation, e.g. " ", "\n" etc., we should use that even if Hive doesn't. http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@57 PS11, Line 57: Used It would be clearer like this: "uppercase" and "hex" only affect the insertion of integers, not that of char values. http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@62 PS11, Line 62: // Hive-compat mode, and the character is not alphanumeric or is one This doesn't match the code. Try this: "... b) we are not in Hive-compat mode and the character is not alphanumeric and it is not one of the characters specifically excluded from escaping (see ShouldNotEscape())." http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test File testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test: http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@302 PS11, Line 302: show tables; I don't think it's needed. Also, if another table is added later before this query, this will have to be changed as well. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@309 PS11, Line 309: insert into unicode_partition_values partition(p='运营业务数据') values (0); You don't need to put these inserts into separate QUERY segments. It could also go together with the select on L318. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@327 PS11, Line 327: drop table unicode_partition_values; You don't need to drop the table, the tables are created in a unique database, see test_column_unicode.py::TestColumnUnicode::test_create_unicode_table. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@337 PS11, Line 337: show tables; I don't think this is needed. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@344 PS11, Line 344: insert into unicode_partition_values_iceberg values (0, '运'); You should do the same insertions into the Iceberg table as into the other one. http://gerrit.cloudera.org:8080/#/c/21131/11/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@354 PS11, Line 354: drop table
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 11: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16052/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 11 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 29 Apr 2024 12:07:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 11: (5 comments) > Patch Set 11: > > (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc@94 PS10, Line 94: TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", false); : TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", true); > Please add some tests like these for newly added marks like '\n', '\r', '^' Done http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@a40 PS10, Line 40: : : : > Can you keep this comment? We should still mention common/src/java/org/apac Done http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@45 PS10, Line 45: // characters it will encode. : // See common/src/java/org/apache/hadoop/hive/common > We are still missing several characters here, e.g. single quote, '\n' and ' Done http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util.cc@51 PS11, Line 51: '\x20' Represents space character http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test@452 PS10, Line 452: RESULTS: RAW_STRING : '0','O''Reilly' : TYPES : STRING, STRING : : QUERY : truncate insert_string_partitioned; : INSERT INTO TABLE insert_string_partitioned PARTITION(s2="Impala''#%*/:=?{}[]^") values ('0'); : SELECT * FROM insert_string_partitioned where s2= "Impala''#%*/:=?{}[]^"; : RESULTS: RAW_STRING : '0','Impala''#%*/:=?{}[]^' : TYPES > e2e tests for ASCII characters can be added here, e.g. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 11 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 29 Apr 2024 11:43:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte was matching with one of the two bytes that represented the '\u00FF' literal. A set containing all the special characters has been included and, '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Some basic tests for both parquet and iceberg are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util-test.cc M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/insert.test M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 4 files changed, 204 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/11 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 11 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util-test.cc@100 PS11, Line 100: "%08%03%04%05%06%07%09%0c%0d%0ebeginners%0f%10%11%12%13%14%15%16%17%18%19%10Amuststart" line too long (93 > 90) http://gerrit.cloudera.org:8080/#/c/21131/11/be/src/util/coding-util-test.cc@103 PS11, Line 103: "\x07\x09\x0C\r\x0E""beginners\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x10""" line too long (91 > 90) -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 11 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 29 Apr 2024 11:42:13 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc File be/src/util/coding-util-test.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util-test.cc@94 PS10, Line 94: TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", false); : TestUrl("/home/impala/directory/", "%2Fhome%2Fimpala%2Fdirectory%2F", true); Please add some tests like these for newly added marks like '\n', '\r', '^', etc. EDIT: please include all the escaped characters. http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@a40 PS10, Line 40: : : : Can you keep this comment? We should still mention common/src/java/org/apache/hadoop/hive/common/FileUtils.java since the list would change there. Also mention here that "Impala creates the partition directories using these encoded strings. So it's crucial to keep it consistent with Hive." http://gerrit.cloudera.org:8080/#/c/21131/10/be/src/util/coding-util.cc@45 PS10, Line 45: static const std::unordered_set SpecialCharacters = {'"', '#', '\\', '*', '/', : ':', '=', '?', '%', '^', '[', ']', '{', '\x7F'}; We are still missing several characters here, e.g. single quote, '\n' and '\r'. I think we should use the same list as Hive's. The list in Hive consists of control characters, i.e. ASCII characters from 0 to 31 and 127 (0x7F), and marks like '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '{', '[', ']', '^'. Code snipper of Hive: static BitSet charToEscape = new BitSet(128); static { for (char c = 0; c < ' '; c++) { charToEscape.set(c); } /** * ASCII 01-1F are HTTP control characters that need to be escaped. * \u000A and \u000D are \n and \r, respectively. */ char[] clist = new char[] {'\u0001', '\u0002', '\u0003', '\u0004', '\u0005', '\u0006', '\u0007', '\u0008', '\u0009', '\n', '\u000B', '\u000C', '\r', '\u000E', '\u000F', '\u0010', '\u0011', '\u0012', '\u0013', '\u0014', '\u0015', '\u0016', '\u0017', '\u0018', '\u0019', '\u001A', '\u001B', '\u001C', '\u001D', '\u001E', '\u001F', '"', '#', '%', '\'', '*', '/', ':', '=', '?', '\\', '\u007F', '{', '[', ']', '^'}; for (char c : clist) { charToEscape.set(c); } Note that Hive adds some characters twice. ASCII code of ' ' is 32. So in the first loop, 0-31 are already added. In Java, char is a "16-bit integer" that represents a single 16-bit Unicode character. So Hive code uses unicodes like '\u007F'. But in C++, such unicodes are represented by UTF-8 bytes. And char in C++ is just 8 bits. So in the original code "\u00FF" is actually two chars. http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test File testdata/workloads/functional-query/queries/QueryTest/insert.test: http://gerrit.cloudera.org:8080/#/c/21131/10/testdata/workloads/functional-query/queries/QueryTest/insert.test@452 PS10, Line 452: INSERT INTO TABLE insert_string_partitioned PARTITION(s2="_.~ +") : SELECT "value" FROM functional.alltypessmall LIMIT 1; : RESULTS : s2=_.~ +: 1 : : QUERY : # select with unencoded partition key : SELECT * FROM insert_string_partitioned; : RESULTS : 'value','_.~ +' : TYPES : string, string e2e tests for ASCII characters can be added here, e.g. INSERT INTO TABLE insert_string_partitioned PARTITION(s2="O'Reilly") values ('0'); Or use a string with all escaped marks. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Sun, 28 Apr 2024 02:12:16 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 10: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Sat, 27 Apr 2024 09:13:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 10: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16042/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Sat, 27 Apr 2024 04:56:11 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 10: (2 comments) > Patch Set 10: > > Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10587/ > DRY_RUN=true http://gerrit.cloudera.org:8080/#/c/21131/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/9//COMMIT_MSG@18 PS9, Line 18: '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', > I don't see '\x7F' in the final code, did you change the approach and not u Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@69 PS6, Line 69: *out = ""; > Not without an adapter. string_view(in, in_len) would work. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Sat, 27 Apr 2024 04:37:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 10: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10587/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Sat, 27 Apr 2024 04:32:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte was matching with one of the two bytes that represented the '\u00FF' literal. A set containing all the special characters has been included and, '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Some basic tests for both parquet and iceberg are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 2 files changed, 79 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/10 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 10 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 9: (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/9//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/9//COMMIT_MSG@18 PS9, Line 18: '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', I don't see '\x7F' in the final code, did you change the approach and not update the commit? Daniel's comment suggested we should include it. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@69 PS6, Line 69: if (in.empty()) { > Could use a range-based for-loop: for (char ch : in) {...}. Not without an adapter. string_view(in, in_len) would work. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 9 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 26 Apr 2024 21:10:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 9: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16035/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 9 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 26 Apr 2024 17:13:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte was matching with one of the two bytes that represented the '\u00FF' literal. A set containing all the special characters has been included and, '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Some basic tests for both parquet and iceberg are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 2 files changed, 79 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/9 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 9 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 8: Build Failed https://jenkins.impala.io/job/gerrit-code-review-checks/16032/ : Initial code review checks failed. See linked job for details on the failure. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 8 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 26 Apr 2024 16:10:16 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The error came from an issue with URL encoding, where certain Unicode characters were being incorrectly encoded due to their UTF-8 representation matching characters in the set of characters to escape. For example, the string '运', which consists of three bytes 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', because the middle byte was matching with one of the two bytes that represented the '\u00FF' literal. A set containing all the special characters has been included and, '\xFF', which is an invalid UTF-8 byte is replaced with '\x7F', which is a control character for DELETE, ensuring consistency and correctness in URL encoding. Testing: Some basic tests for both parquet and iceberg are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 2 files changed, 79 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/8 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 8 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 8: (15 comments) > Uploaded patch set 8: Commit message was updated. http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@7 PS6, Line 7: IMPALA-11499: Refactor UrlEncode function to handle special characters > We don't usually break the title into two lines, and it is not too long eit Done http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@8 PS6, Line 8: > nit: let's put the title in one line Done http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@12 PS6, Line 12: bytes > Nit: surround with quotes ('specialCharacterMap'). Ack http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@13 PS6, Line 13: 0xe8 0xbf 0x90 was wrongly getting encoded into '\E8%FFBF\90', > It's unclear to me how the unicode characters are handled incorrectly. E.g. Ack http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@18 PS6, Line 18: '\xFF', which is an invalid UTF-8 byte is replaced > '*out' was assigned to also before this change, so its old contents were di Ack http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@23 PS6, Line 23: #include > I don't think we need iostream. Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@46 PS6, Line 46: ':', '=', '?', > Could also be static. Alternatively, we could put all these static function Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@55 PS6, Line 55: / cha > The problem in the old code was that "\u00FF" translates to two bytes: [194 Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@66 PS6, Line 66: } > No need to clear it if it is assigned to at the end of the function. Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@74 PS6, Line 74: : > Could be misleading: does "not" also apply to "one of the commonly..."? If Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80 PS6, Line 80: > This is easier to understand if converted to a form without parentheses: Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80 PS6, Line 80: > It's an existing issue but I think we should cast 'ch' to unsigned char as Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85 PS6, Line 85: > nit: I think we can ignore explicitly using "std::" since std::uppercase is Done http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85 PS6, Line 85: > We could put std::uppercase and std::hex after L68, before the loop, as the Done http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test File testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test: http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test@298 PS3, Line 298: > This comment has not been addressed. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 8 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 26 Apr 2024 15:51:23 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 6: (13 comments) http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@7 PS6, Line 7: IMPALA-11499: Refactor UrlEncode function to handle special We don't usually break the title into two lines, and it is not too long either. http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@12 PS6, Line 12: specialCharacterMap Nit: surround with quotes ('specialCharacterMap'). http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@18 PS6, Line 18: Additionally, the function clears the output string '*out' was assigned to also before this change, so its old contents were discarded. This sentence could be removed. http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@48 PS3, Line 48: {'#', "%23"}, > Should be const. This comment has not been addressed. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@23 PS6, Line 23: #include I don't think we need iostream. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@46 PS6, Line 46: std::unordered_map Could also be static. Alternatively, we could put all these static functions and variables into an anonymous namespace, which is a more modern way of ensuring that these are not visible outside this translation unit. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@55 PS6, Line 55: '\xFF The problem in the old code was that "\u00FF" translates to two bytes: [194,191]. Note that the notation '\u' is Java-specific. The process is broken because the code takes the input bytes one by one. The hanzi character '?', in UTF-8, is [232, 191, 144]. Its second character, 191, matched with the second character of "\u00FF" and that caused the problem. I don't know why '\xFF' was included in the first place. If we'd like to handle UTF-8 only, then '\xFF' is not a valid byte. If we'd like to handle non-UTF-8 bytes, then we should handle all other bytes > 127. After some digging in Hive, I found this commit that expanded the list of escaped characters: https://github.com/apache/hive/commit/32b046bf93e3d041b2bb07a1c9b4e1ef5d977ddf The list before that commit is very similar to what we have here in the old code, except they have '\u007F' instead of '\u00FF'. That makes more sense as that is a control character for DELETE, see https://www.compart.com/en/unicode/U+007F. If we remove '\xFF', we can see that the escaped values in the map are the same as how we HEX encode values on L85, so there is no need for these values to be included in a map. Instead, we could have a set (std::unordered_set) with the keys in the current map (without '\xFF' but including '\x7F'), and we would encode input characters if they are in the set (taking into account hive_compat of course). That would also simplify the code. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@66 PS6, Line 66: (*out).clear(); No need to clear it if it is assigned to at the end of the function. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@69 PS6, Line 69: for (int i = 0; i < in_len; ++i) { Could use a range-based for-loop: for (char ch : in) {...}. http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@74 PS6, Line 74: is not : // alphanumeric or one of the commonly excluded characters Could be misleading: does "not" also apply to "one of the commonly..."? If not, we could write for example "is not alphanumeric or is one of the commonly excluded ...". http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80 PS6, Line 80: !(std::isalnum(ch) || ShouldNotEscape(ch)) This is easier to understand if converted to a form without parentheses: !std::isalnum(ch) && !ShouldNotEscape(ch) http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85 PS6, Line 85: std::uppercase << std::hex We could put std::uppercase and std::hex after L68, before the loop, as these settings don't get reset automatically, so no need to set them every time. On the other hand, I would include a comment stating that this only applies to when integer values are inserted and not when char values, therefore it doesn't cause a problem when not escaping characters (like on L88). http://gerrit.cloudera.org:8080/#/c/21131/3/testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test File testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test:
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 6: (4 comments) http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@8 PS6, Line 8: characters nit: let's put the title in one line http://gerrit.cloudera.org:8080/#/c/21131/6//COMMIT_MSG@13 PS6, Line 13: that accurately maps special characters to their URL-encoded forms. It's unclear to me how the unicode characters are handled incorrectly. E.g. the one mentioned in the JIRA description is "?" which is encoded into 3 bytes in UTF-8: 0xe8 0xbf 0x90. Could you mention in the commit message how this example fails before and works now? FWIW, there are online tools to convert Unicode to UTF-8, e.g. https://onlinetools.com/unicode/convert-unicode-to-utf8 http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@80 PS6, Line 80: std::isalnum(ch) It's an existing issue but I think we should cast 'ch' to unsigned char as memtioned in https://en.cppreference.com/w/cpp/string/byte/isalnum > the behavior of std::isalnum is undefined if the argument's value is neither > representable as unsigned char nor equal to EOF. To use these functions > safely with plain chars (or signed chars), the argument should first be > converted to unsigned char: > std::isalnum(static_cast(ch)); http://gerrit.cloudera.org:8080/#/c/21131/6/be/src/util/coding-util.cc@85 PS6, Line 85: std:: nit: I think we can ignore explicitly using "std::" since std::uppercase is introduced at L37. For std::hex, it's introduced in "common/names.h": https://github.com/apache/impala/blob/b39cd79ae84c415e0aebec2c2b4d7690d2a0cc7a/be/src/common/names.h#L82 Maybe the same for "std::isalnum". -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Fri, 26 Apr 2024 09:03:05 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 6: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 25 Apr 2024 15:42:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 6: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/16014/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 25 Apr 2024 03:46:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 6: (1 comment) > Uploaded patch set 6. http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc@117 PS4, Line 117: istringstream is(in.substr(i + 1, 2)); > Not actually done, there's still a newline added here. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Thu, 25 Apr 2024 03:20:53 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The previous code suffered from URL encoding errors due to incomplete character whitelisting and inadequate handling of Unicode characters. This commit addresses these issues by introducing a specialCharacterMap that accurately maps special characters to their URL-encoded forms. The UrlEncode function now utilizes this map to determine whether a character should be escaped. If present in the map, the character's encoded form is used; otherwise, it falls back to hexadecimal encoding. Additionally, the function clears the output string before encoding to ensure a clean start and efficiently accumulates encoded characters using stringstream. These improvements ensure accurate encoding of all characters, including special and Unicode characters, resolving the errors encountered during URL encoding. Testing: Some basic tests are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 2 files changed, 77 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/6 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 6 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc@83 PS4, Line 83: } else { > Done Ack http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc@117 PS4, Line 117: int value = 0; > Done Not actually done, there's still a newline added here. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Wed, 24 Apr 2024 18:30:18 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 5: Build Successful https://jenkins.impala.io/job/gerrit-code-review-checks/15981/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 22 Apr 2024 16:59:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 5: (3 comments) > Uploaded patch set 5. http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@83 PS3, Line 83: } else { > The exception is thrown here in the catalogd: I've tried adding all the cases in the map now. http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc@83 PS4, Line 83: } else { > Repeated lookup on specialCharacterMap. You could do Done http://gerrit.cloudera.org:8080/#/c/21131/4/be/src/util/coding-util.cc@117 PS4, Line 117: int value = 0; > Avoid unnecessary whitespace change. Done -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 22 Apr 2024 16:33:12 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. IMPALA-11499: Refactor UrlEncode function to handle special characters The previous code suffered from URL encoding errors due to incomplete character whitelisting and inadequate handling of Unicode characters. This commit addresses these issues by introducing a specialCharacterMap that accurately maps special characters to their URL-encoded forms. The UrlEncode function now utilizes this map to determine whether a character should be escaped. If present in the map, the character's encoded form is used; otherwise, it falls back to hexadecimal encoding. Additionally, the function clears the output string before encoding to ensure a clean start and efficiently accumulates encoded characters using stringstream. These improvements ensure accurate encoding of all characters, including special and Unicode characters, resolving the errors encountered during URL encoding. Testing: Some basic tests are provided in unicode-column-name.test. Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb --- M be/src/util/coding-util.cc M testdata/workloads/functional-query/queries/QueryTest/unicode-column-name.test 2 files changed, 78 insertions(+), 19 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/31/21131/5 -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 5 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@83 PS3, Line 83: if (!hive_compat) { > I think '$' and other special characters are not supported or is it another That was previously possible. The following works for me before this change: create table foo (i int) partitioned by (s string); insert into foo partition (s = "$") values (1), (2); -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 19 Mar 2024 18:52:27 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 3: Verified-1 Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/10380/ -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 18 Mar 2024 14:59:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 3: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/10380/ DRY_RUN=true -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 18 Mar 2024 10:24:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
pranav.lo...@cloudera.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 3: (1 comment) > Patch Set 3: > > (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/3/be/src/util/coding-util.cc@83 PS3, Line 83: if (!hive_compat) { > Are these additional encodings the same that were done before this change? I think '$' and other special characters are not supported or is it another bug: ERROR: TableLoadingException: Failed to load metadata for table: default.my_part_tbl CAUSED BY: NullPointerException: Invalid partition name: p=$@%23%^& -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Mon, 18 Mar 2024 09:24:21 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-11499: Refactor UrlEncode function to handle special characters
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/21131 ) Change subject: IMPALA-11499: Refactor UrlEncode function to handle special characters .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/21131/1/be/src/util/coding-util.cc File be/src/util/coding-util.cc: http://gerrit.cloudera.org:8080/#/c/21131/1/be/src/util/coding-util.cc@84 PS1, Line 84: // Encode forward slashes to prevent misinterpretation in paths or URLs > Yes, that's correct. The UrlEncode function in the provided code is designe As Daniel said, please add a clearer description of the problem and solution to the commit message. What you just said here is a good basis for that. -- To view, visit http://gerrit.cloudera.org:8080/21131 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88c4aba5d811dfcec809583d0c16fcbc0ca730fb Gerrit-Change-Number: 21131 Gerrit-PatchSet: 3 Gerrit-Owner: Anonymous Coward Gerrit-Reviewer: Anonymous Coward Gerrit-Reviewer: Daniel Becker Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Michael Smith Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Zihao Ye Gerrit-Comment-Date: Tue, 12 Mar 2024 18:40:19 + Gerrit-HasComments: Yes