[ 
https://issues.apache.org/jira/browse/SPARK-48973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangyang Gao updated SPARK-48973:
---------------------------------
    Description: 
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{🙂}}


{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}


could cause result is `**` instead of `*`. Looks spark mask treat {{🙂}} as 2 
characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "🙂");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 

  was:
In the spark the mask function when apply with a stirng contains invalid 
character or wide character would cause unexpected behavior.


Example to use `*` mask a stirng contains wide-character {{🙂}}


{code:sql}
select mask("🙂", "Y", "y", "n", "*");
{code}


could cause result is {{**}} instead of {{*}}. Looks spark mask treat {{🙂}} as 
2 characters.

Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
problem


{code:sql}
select mask("ABC", "🙂");
{code}

result is `???`.

Example to mask a string contains a invalid UTF-8 character

{code:java}
select mask("\xED");
{code}

result is `xXX` instead of `\xED`, looks spark treat it as four character `\`, 
`x`, `E`, `D`.

Looks spark mask can only handle BMP character (that is 16 bits) and can't 
guarantee result for invalid UTC-8 character and wide-character when doing mask.


My question here is *does that the limitation / issue of spark mask function or 
spark mask by design only handle for BMP character ?*

If it is a limitation of mask function, could spark address this part in mask 
function document or comments ?

 


> Unexpected behavior using spark mask function handle string contains invalid 
> UTF-8 or wide character
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-48973
>                 URL: https://issues.apache.org/jira/browse/SPARK-48973
>             Project: Spark
>          Issue Type: Question
>          Components: SQL
>    Affects Versions: 3.3.1, 3.2.4, 3.3.2, 3.4.1, 3.5.0, 4.0.0, 3.5.1, 3.3.4
>         Environment: Ubuntu 22.04
>            Reporter: Yangyang Gao
>            Priority: Major
>
> In the spark the mask function when apply with a stirng contains invalid 
> character or wide character would cause unexpected behavior.
> Example to use `*` mask a stirng contains wide-character {{🙂}}
> {code:sql}
> select mask("🙂", "Y", "y", "n", "*");
> {code}
> could cause result is `**` instead of `*`. Looks spark mask treat {{🙂}} as 2 
> characters.
> Example to use wide-character {{🙂}} do mask would cause wrong garbled code 
> problem
> {code:sql}
> select mask("ABC", "🙂");
> {code}
> result is `???`.
> Example to mask a string contains a invalid UTF-8 character
> {code:java}
> select mask("\xED");
> {code}
> result is `xXX` instead of `\xED`, looks spark treat it as four character 
> `\`, `x`, `E`, `D`.
> Looks spark mask can only handle BMP character (that is 16 bits) and can't 
> guarantee result for invalid UTC-8 character and wide-character when doing 
> mask.
> My question here is *does that the limitation / issue of spark mask function 
> or spark mask by design only handle for BMP character ?*
> If it is a limitation of mask function, could spark address this part in mask 
> function document or comments ?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to