[ 
https://issues.apache.org/jira/browse/HIVE-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-26639:
--------------------------------
    Description: 
In HS2 (and other components) we rely on UTF8 encoding, hence while storing 
strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on 
default system encoding in different ways, which can lead to incorrect encoding 
(if system settings defaults other than UTF8). This patch intends to fix 2 
different paths:

1. ConstantVectorExpression
in my case, this:
{code}
LOG.info("default charset name: " + 
java.nio.charset.Charset.defaultCharset().name());
LOG.info("getBytes() = " + ((String) constantValue).getBytes());
LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) 
constantValue).getBytes(StandardCharsets.UTF_8));
{code}
led to:
{code}
default charset name: US-ASCII
getBytes() = [B@73dcffb0
getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
{code}

on the customer side, queries returned wrong results when the filter contained 
the special character (which is part of UTF8 character table):
{code}
SELECT b FROM default.rlv_test1 where b='北京';
....
??
{code}


2. Explain
Similarly, explain printed to a PrintStream of different encoding, leading to a 
plan like:
{code}
                    Map Operator Tree:
                        TableScan
                          alias: rlv_test1
                          filterExpr: (b = '??') (type: boolean)
                          Statistics: Num rows: 2 Data size: 352 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Filter Operator
                            predicate: (b = '??') (type: boolean)
                            Statistics: Num rows: 2 Data size: 352 Basic stats: 
COMPLETE Column stats: COMPLETE
                            Select Operator
                              expressions: a (type: int), '??' (type: string), 
c (type: string)
{code}

  was:
In HS2 (and other components) we rely on UTF8 encoding, hence while storing 
strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on 
default system encoding in different ways, which can lead to incorrect encoding 
(if system settings defaults other than UTF8). This patch intends to fix 2 
different paths:
1. ConstantVectorExpression
in my case, this:
{code}
LOG.info("default charset name: " + 
java.nio.charset.Charset.defaultCharset().name());
LOG.info("getBytes() = " + ((String) constantValue).getBytes());
LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) 
constantValue).getBytes(StandardCharsets.UTF_8));
{code}
led to:
{code}
default charset name: US-ASCII
getBytes() = [B@73dcffb0
getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
{code}

on the customer side, queries returned wrong results when the filter contained 
the special character (which is part of UTF8 character table):
{code}
SELECT b FROM default.rlv_test1 where b='北京';
....
??
{code}


2. Explain
Similarly, explain printed to a PrintStream of different encoding, leading to a 
plan like:
{code}
                    Map Operator Tree:
                        TableScan
                          alias: rlv_test1
                          filterExpr: (b = '??') (type: boolean)
                          Statistics: Num rows: 2 Data size: 352 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Filter Operator
                            predicate: (b = '??') (type: boolean)
                            Statistics: Num rows: 2 Data size: 352 Basic stats: 
COMPLETE Column stats: COMPLETE
                            Select Operator
                              expressions: a (type: int), '??' (type: string), 
c (type: string)
{code}


> ConstantVectorExpression shouldn't rely on default charset
> ----------------------------------------------------------
>
>                 Key: HIVE-26639
>                 URL: https://issues.apache.org/jira/browse/HIVE-26639
>             Project: Hive
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In HS2 (and other components) we rely on UTF8 encoding, hence while storing 
> strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on 
> default system encoding in different ways, which can lead to incorrect 
> encoding (if system settings defaults other than UTF8). This patch intends to 
> fix 2 different paths:
> 1. ConstantVectorExpression
> in my case, this:
> {code}
> LOG.info("default charset name: " + 
> java.nio.charset.Charset.defaultCharset().name());
> LOG.info("getBytes() = " + ((String) constantValue).getBytes());
> LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) 
> constantValue).getBytes(StandardCharsets.UTF_8));
> {code}
> led to:
> {code}
> default charset name: US-ASCII
> getBytes() = [B@73dcffb0
> getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c
> {code}
> on the customer side, queries returned wrong results when the filter 
> contained the special character (which is part of UTF8 character table):
> {code}
> SELECT b FROM default.rlv_test1 where b='北京';
> ....
> ??
> {code}
> 2. Explain
> Similarly, explain printed to a PrintStream of different encoding, leading to 
> a plan like:
> {code}
>                   Map Operator Tree:
>                       TableScan
>                         alias: rlv_test1
>                         filterExpr: (b = '??') (type: boolean)
>                         Statistics: Num rows: 2 Data size: 352 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                         Filter Operator
>                           predicate: (b = '??') (type: boolean)
>                           Statistics: Num rows: 2 Data size: 352 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           Select Operator
>                             expressions: a (type: int), '??' (type: string), 
> c (type: string)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to