[ https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yangyang Gao updated SPARK-48973: --------------------------------- Description: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{🙂}} {code:sql} select mask("🙂", "Y", "y", "n", "*"); {code} could cause result is `**` instead of `*`. Looks spark mask treat {{🙂}} as 2 characters. Example to use wide-character {{🙂}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", "🙂"); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? was: In the spark the mask function when apply with a stirng contains invalid character or wide character would cause unexpected behavior. Example to use `*` mask a stirng contains wide-character {{🙂}} {code:sql} select mask("🙂", "Y", "y", "n", "*"); {code} could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{🙂}} as 2 characters. Example to use wide-character {{🙂}} do mask would cause wrong garbled code problem {code:sql} select mask("ABC", "🙂"); {code} result is `???`. Example to mask a string contains a invalid UTF-8 character {code:java} select mask("\xED"); {code} result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, `x`, `E`, `D`. Looks spark mask can only handle BMP character (that is 16 bits) and can't guarantee result for invalid UTC-8 character and wide-character when doing mask. My question here is *does that the limitation / issue of spark mask function or spark mask by design only handle for BMP character ?* If it is a limitation of mask function, could spark address this part in mask function document or comments ? > Unexpected behavior using spark mask function handle string contains invalid > UTF-8 or wide character > ---------------------------------------------------------------------------------------------------- > > Key: SPARK-48973 > URL: https://issues.apache.org/jira/browse/SPARK-48973 > Project: Spark > Issue Type: Question > Components: SQL > Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4 > Environment: Ubuntu 22.04 > Reporter: Yangyang Gao > Priority: Major > > In the spark the mask function when apply with a stirng contains invalid > character or wide character would cause unexpected behavior. > Example to use `*` mask a stirng contains wide-character {{🙂}} > {code:sql} > select mask("🙂", "Y", "y", "n", "*"); > {code} > could cause result is `**` instead of `*`. Looks spark mask treat {{🙂}} as 2 > characters. > Example to use wide-character {{🙂}} do mask would cause wrong garbled code > problem > {code:sql} > select mask("ABC", "🙂"); > {code} > result is `???`. > Example to mask a string contains a invalid UTF-8 character > {code:java} > select mask("\xED"); > {code} > result is `xXX` instead of `\xED`, looks spark treat it as four character > `\`, `x`, `E`, `D`. > Looks spark mask can only handle BMP character (that is 16 bits) and can't > guarantee result for invalid UTC-8 character and wide-character when doing > mask. > My question here is *does that the limitation / issue of spark mask function > or spark mask by design only handle for BMP character ?* > If it is a limitation of mask function, could spark address this part in mask > function document or comments ? > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org