yucai created HIVE-21042: ---------------------------- Summary: varchar and string will behave differently for non-standard utf8 characters Key: HIVE-21042 URL: https://issues.apache.org/jira/browse/HIVE-21042 Project: Hive Issue Type: Bug Reporter: yucai
If data contains non-standard utf8 characters, varchar and string will behave differently. The content in /tmp/hex_data is "6130373530633166313366306B35*B0A5*46386A*8DAEAB*62B4526F273464613936". *B0A5 and* *8DAEAB* are non-standard utf8 characters, they are encoded to EFBFBD if the data type is varchar, which string will not change it. So: VARCHAR shows: 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 STRING shows: 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 See details: {code:java} hive> DROP TABLE TBL_S; OK Time taken: 0.562 seconds hive> CREATE TABLE TBL_S > ( > GUID STRING > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 3.074 seconds hive> > DROP TABLE TBL_V; OK Time taken: 0.894 seconds hive> CREATE TABLE TBL_V > ( > GUID VARCHAR(32) > ) > row format delimited fields terminated by '\177' stored as textfile > LOCATION > '/tmp/hex_data' > tblproperties('serialization.null.format'='', 'timestamp.formats' = 'yyyy-MM-dd HH:mm:ss') > ; OK Time taken: 0.242 seconds hive> SELECT GUID, hex(GUID) FROM TBL_S; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35B0A546386A8DAEAB62B4526F273464613936 Time taken: 1.581 seconds, Fetched: 1 row(s) hive> SELECT GUID, hex(GUID) FROM TBL_V; OK a0750c1f13f0k5��F8j���b�Ro'4da96 6130373530633166313366306B35EFBFBDEFBFBD46386AEFBFBDEFBFBDEFBFBD62EFBFBD526F273464613936 {code} Is it expected? -- This message was sent by Atlassian JIRA (v7.6.3#76005)