[ 
https://issues.apache.org/jira/browse/HIVE-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901753#action_12901753
 ] 

Ted Xu commented on HIVE-1505:
------------------------------

Thanks Edward.

I dug into the problem and found the patch will not working when the query have 
subqueries, it is very hard to retain encoding information in those queries.

Table properties may miss in queries, the problem is the same as missing field 
delimiter setting, because whenever hive can't get table properties in subquery 
(e.g., join operation), the default value is used (^A for field delimiter, 
that's why the deserializer will fail most of the time when data contains ^A 
character even if ^A is not set for field delimiter).

 

> Support non-UTF8 data
> ---------------------
>
>                 Key: HIVE-1505
>                 URL: https://issues.apache.org/jira/browse/HIVE-1505
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Serializers/Deserializers
>    Affects Versions: 0.5.0
>            Reporter: bc Wong
>            Assignee: Ted Xu
>         Attachments: trunk-encoding.patch
>
>
> I'd like to work with non-UTF8 data easily.
> Suppose I have data in latin1. Currently, doing a "select *" will return the 
> upper ascii characters in '\xef\xbf\xbd', which is the replacement character 
> '\ufffd' encoded in UTF-8. Would be nice for Hive to understand different 
> encodings, or to have a concept of byte string.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to