[ https://issues.apache.org/jira/browse/HIVE-26639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
László Bodor updated HIVE-26639: -------------------------------- Description: In HS2 (and other components) we rely on UTF8 encoding, hence while storing strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on default system encoding in different ways, which can lead to incorrect encoding (if system settings defaults other than UTF8). This patch intends to fix 2 different paths: 1. ConstantVectorExpression in my case, this: {code} LOG.info("default charset name: " + java.nio.charset.Charset.defaultCharset().name()); LOG.info("getBytes() = " + ((String) constantValue).getBytes()); LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) constantValue).getBytes(StandardCharsets.UTF_8)); {code} led to: {code} default charset name: US-ASCII getBytes() = [B@73dcffb0 getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c {code} on the customer side, queries returned wrong results when the filter contained the special character (which is part of UTF8 character table): {code} SELECT b FROM default.rlv_test1 where b='北京'; .... ?? {code} 2. Explain Similarly, explain printed to a PrintStream of different encoding, leading to a plan like: {code} Map Operator Tree: TableScan alias: rlv_test1 filterExpr: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), '??' (type: string), c (type: string) {code} was: In HS2 (and other components) we rely on UTF8 encoding, hence while storing strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on default system encoding in different ways, which can lead to incorrect encoding (if system settings defaults other than UTF8). This patch intends to fix 2 different paths: 1. ConstantVectorExpression in my case, this: {code} LOG.info("default charset name: " + java.nio.charset.Charset.defaultCharset().name()); LOG.info("getBytes() = " + ((String) constantValue).getBytes()); LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) constantValue).getBytes(StandardCharsets.UTF_8)); {code} led to: {code} default charset name: US-ASCII getBytes() = [B@73dcffb0 getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c {code} on the customer side, queries returned wrong results when the filter contained the special character (which is part of UTF8 character table): {code} SELECT b FROM default.rlv_test1 where b='北京'; .... ?? {code} 2. Explain Similarly, explain printed to a PrintStream of different encoding, leading to a plan like: {code} Map Operator Tree: TableScan alias: rlv_test1 filterExpr: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (b = '??') (type: boolean) Statistics: Num rows: 2 Data size: 352 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: a (type: int), '??' (type: string), c (type: string) {code} > ConstantVectorExpression shouldn't rely on default charset > ---------------------------------------------------------- > > Key: HIVE-26639 > URL: https://issues.apache.org/jira/browse/HIVE-26639 > Project: Hive > Issue Type: Bug > Reporter: László Bodor > Assignee: László Bodor > Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In HS2 (and other components) we rely on UTF8 encoding, hence while storing > strings as bytes, we store the UTF8-encoded bytes. Some java APIs rely on > default system encoding in different ways, which can lead to incorrect > encoding (if system settings defaults other than UTF8). This patch intends to > fix 2 different paths: > 1. ConstantVectorExpression > in my case, this: > {code} > LOG.info("default charset name: " + > java.nio.charset.Charset.defaultCharset().name()); > LOG.info("getBytes() = " + ((String) constantValue).getBytes()); > LOG.info("getBytes(StandardCharsets.UTF_8) = " + ((String) > constantValue).getBytes(StandardCharsets.UTF_8)); > {code} > led to: > {code} > default charset name: US-ASCII > getBytes() = [B@73dcffb0 > getBytes(StandardCharsets.UTF_8) = [B@2ead0b9c > {code} > on the customer side, queries returned wrong results when the filter > contained the special character (which is part of UTF8 character table): > {code} > SELECT b FROM default.rlv_test1 where b='北京'; > .... > ?? > {code} > 2. Explain > Similarly, explain printed to a PrintStream of different encoding, leading to > a plan like: > {code} > Map Operator Tree: > TableScan > alias: rlv_test1 > filterExpr: (b = '??') (type: boolean) > Statistics: Num rows: 2 Data size: 352 Basic stats: > COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (b = '??') (type: boolean) > Statistics: Num rows: 2 Data size: 352 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: a (type: int), '??' (type: string), > c (type: string) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)