[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-04-09 Thread Kevin Wilfong (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wilfong updated HIVE-4199:


Status: Open  (was: Patch Available)

 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-03-28 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-4199:
--

Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch

sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle 
non-UTF8 encoded Text properly.

  Updated test case to clarify the expected behaviour.

Reviewers: kevinwilfong

REVISION DETAIL
  https://reviews.facebook.net/D9501

CHANGE SINCE LAST DIFF
  https://reviews.facebook.net/D9501?vs=30009id=30675#toc

AFFECTED FILES
  data/files/nonutf8.txt
  ql/src/test/results/clientpositive/orc_nonutf8.q.out
  ql/src/test/queries/clientpositive/orc_nonutf8.q
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java

To: kevinwilfong, sxyuan
Cc: JIRA


 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-03-18 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-4199:
--

Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch

sxyuan requested code review of HIVE-4199 [jira] ORC writer doesn't handle 
non-UTF8 encoded Text properly.

Reviewers: kevinwilfong

StringTreeWriter currently converts fields stored as Text objects into Strings. 
This can lose information (see 
http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and 
is also unnecessary since the dictionary stores Text objects.

Instead, we can check whether Text or String is preferred and simply use the 
preferred class, converting only to String for the index stats.

TEST PLAN
  Run unit tests, including new query. The join in the test query originally 
produces no results because of the bug.

REVISION DETAIL
  https://reviews.facebook.net/D9501

AFFECTED FILES
  data/files/nonutf8.txt
  ql/src/test/results/clientpositive/orc_nonutf8.q.out
  ql/src/test/queries/clientpositive/orc_nonutf8.q
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java

MANAGE HERALD RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/22719/

To: kevinwilfong, sxyuan
Cc: JIRA


 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-03-18 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-4199:
--

Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch

sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle 
non-UTF8 encoded Text properly.

  Phabricator is converting the non-UTF8 data file to UTF8, defeating the 
purpose of the test. Trying a raw diff.

Reviewers: kevinwilfong

REVISION DETAIL
  https://reviews.facebook.net/D9501

CHANGE SINCE LAST DIFF
  https://reviews.facebook.net/D9501?vs=29967id=29973#toc

AFFECTED FILES
  ql/src/test/results/clientpositive/orc_nonutf8.q.out
  ql/src/test/queries/clientpositive/orc_nonutf8.q
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java
  data/files/nonutf8.txt

To: kevinwilfong, sxyuan
Cc: JIRA


 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-03-18 Thread Phabricator (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phabricator updated HIVE-4199:
--

Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch

sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle 
non-UTF8 encoded Text properly.

  Making the new data file binary.

Reviewers: kevinwilfong

REVISION DETAIL
  https://reviews.facebook.net/D9501

CHANGE SINCE LAST DIFF
  https://reviews.facebook.net/D9501?vs=29973id=30009#toc

AFFECTED FILES
  data/files/nonutf8.txt
  ql/src/test/results/clientpositive/orc_nonutf8.q.out
  ql/src/test/queries/clientpositive/orc_nonutf8.q
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java

To: kevinwilfong, sxyuan
Cc: JIRA


 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly

2013-03-18 Thread Samuel Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Samuel Yuan updated HIVE-4199:
--

Status: Patch Available  (was: Open)

 ORC writer doesn't handle non-UTF8 encoded Text properly
 

 Key: HIVE-4199
 URL: https://issues.apache.org/jira/browse/HIVE-4199
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Samuel Yuan
Assignee: Samuel Yuan
Priority: Minor
 Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, 
 HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch


 StringTreeWriter currently converts fields stored as Text objects into 
 Strings. This can lose information (see 
 http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), 
 and is also unnecessary since the dictionary stores Text objects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira