[jira] [Commented] (HIVE-3245) UTF encoded data not displayed correctly by Hive driver

Szehon Ho (JIRA) Fri, 06 Dec 2013 16:55:08 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841965#comment-13841965
 ]


Szehon Ho commented on HIVE-3245:
---------------------------------

I created the table as described in the JIRA and ran select * both from beeline 
and my own java program embedding the JDBC driver.  In both instances, the 
Japanese characters displayed correctly:

0: jdbc:hive2://localhost:10000> select * from japan_j;
+-------+------------------------------------------------+------+
| rnum  |                       c1                       | ord  |
+-------+------------------------------------------------+------+
| 11    | (1)ｲﾝﾃﾞｯｸｽ                                     | 36   |
| 12    | <5>Switches                                    | 37   |
| 10    | 400ｒａｎｋｕ                                       | 39   |
| 9     | 666Sink                                        | 40   |
| 14    | P-Cabels                                       | 35   |
| 13    | R-Bench                                        | 38   |
| 27    | エコー                                            | 34   |
| 26    | エチャント                                          | 24   |
| 25    | ガード                                            | 4    |
| 28    | コート                                            | 3    |
| 29    | ゴム                                             | 1    |
| 41    | ざぶと                                            | 2    |
| 40    | さんしょう                                          | 6    |
| 31    | ズボン                                            | 5    |
| 30    | スワップ                                           | 41   |
| 37    | せっけい                                           | 42   |
| 36    | せんたくざい                                         | 46   |
| 32    | ダイエル                                           | 45   |
| 39    | はっぽ                                            | 43   |
| 38    | はつ剤                                            | 44   |
| 34    | ファイル                                           | 48   |
| 33    | フィルター                                          | 50   |
| 35    | フッコク                                           | 49   |
| 8     | 「２」計画                                          | 47   |
| 46    | 暗視                                             | 9    |
| 45    | 音楽                                             | 8    |
| 47    | 音声認識                                           | 7    |
| 44    | 記載                                             | 10   |
| 43    | 記録機                                            | 11   |
| 42    | 高機能                                            | 15   |
| 50    | 国家利益                                           | 14   |
| 48    | 国立公園                                           | 18   |
| 49    | 国立大学                                           | 22   |
| 7     | ⑤号線路                                           | 21   |
| 5     | （Ⅰ）番号列                                         | 23   |
| 1     | ３５６CAL                                         | 17   |
| 2     | ９８０Series                                      | 16   |
| 6     | ＜ⅸ＞Pattern                                     | 20   |
| 3     | ＰＶＤＦ                                           | 19   |
| 4     | ＲＯＭＡＮ-８                                        | 13   |
| 15    | ｱﾝｶｰ                                           | 12   |
| 16    | ｴﾝｼﾞﾝ                                          | 30   |
| 19    | ｶｯﾄﾏｼﾝ                                         | 29   |
| 20    | ｶｰﾄﾞ                                           | 28   |
| 18    | ｺｰﾗ                                            | 26   |
| 17    | ｺﾞｰﾙﾄﾞ                                         | 25   |
| 24    | ｻｲﾌ                                            | 27   |
| 21    | ﾂｰｳｨﾝｸﾞ                                        | 32   |
| 23    | ﾌｫﾙﾀﾞｰ                                         | 33   |
| 22    | ﾏﾝﾎﾞ                                           | 31   |
+-------+------------------------------------------------+------+


I tested with the new JDBCDriver (org.apache.hive.jdbc.HiveDriver) against 
HiveServer2.  

The platform running Beeline should be set to utf8 ("echo $LANG"), or any other 
java application using JDBC driver should have be started with utf-8 JVM args 
("java -Dfile.encoding=UTF-8").  That should already be a requirement for 
client's wishing to display utf-8 characters.

The code that Mark Grover mentioned does not apply anymore, as new JDBCDriver 
gets results from HiveServer directly via ThriftString field, and does not do 
another round of serialization/deserialization on client side, where it is said 
the error occurred.  So in my opinion, the issue can be closed for Hive driver.

> UTF encoded data not displayed correctly by Hive driver
> -------------------------------------------------------
>
>                 Key: HIVE-3245
>                 URL: https://issues.apache.org/jira/browse/HIVE-3245
>             Project: Hive
>          Issue Type: Bug
>          Components: JDBC
>    Affects Versions: 0.8.0
>            Reporter: N Campbell
>            Assignee: Szehon Ho
>         Attachments: ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg, CERT.TLJA.txt
>
>
> various foreign language data (i.e. japanese, thai etc) is loaded into string 
> columns via tab delimited text files. A simple projection of the columns in 
> the table is not displaying the correct data. Exporting the data from Hive 
> and looking at the files implies the data is loaded properly. it appears to 
> be an encoding issue at the driver but unaware of any required URL connection 
> properties re encoding that Hive JDBC requires.
> create table if not exists CERT.TLJA_JP_E ( RNUM int , C1 string, ORD int)
> row format delimited
> fields terminated by '\t'
> stored as textfile;
> create table if not exists CERT.TLJA_JP ( RNUM int , C1 string, ORD int)
> stored as sequencefile;
> load data local inpath '/home/hadoopadmin/jdbc-cert/CERT/CERT.TLJA_JP.txt'
> overwrite into table CERT.TLJA_JP_E;
> insert overwrite table CERT.TLJA_JP  select * from CERT.TLJA_JP_E;



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HIVE-3245) UTF encoded data not displayed correctly by Hive driver

Reply via email to