[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wilfong updated HIVE-4199: Status: Open (was: Patch Available) ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HIVE-4199: -- Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle non-UTF8 encoded Text properly. Updated test case to clarify the expected behaviour. Reviewers: kevinwilfong REVISION DETAIL https://reviews.facebook.net/D9501 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D9501?vs=30009id=30675#toc AFFECTED FILES data/files/nonutf8.txt ql/src/test/results/clientpositive/orc_nonutf8.q.out ql/src/test/queries/clientpositive/orc_nonutf8.q ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java To: kevinwilfong, sxyuan Cc: JIRA ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.4.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HIVE-4199: -- Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch sxyuan requested code review of HIVE-4199 [jira] ORC writer doesn't handle non-UTF8 encoded Text properly. Reviewers: kevinwilfong StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. Instead, we can check whether Text or String is preferred and simply use the preferred class, converting only to String for the index stats. TEST PLAN Run unit tests, including new query. The join in the test query originally produces no results because of the bug. REVISION DETAIL https://reviews.facebook.net/D9501 AFFECTED FILES data/files/nonutf8.txt ql/src/test/results/clientpositive/orc_nonutf8.q.out ql/src/test/queries/clientpositive/orc_nonutf8.q ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java MANAGE HERALD RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/22719/ To: kevinwilfong, sxyuan Cc: JIRA ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HIVE-4199: -- Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle non-UTF8 encoded Text properly. Phabricator is converting the non-UTF8 data file to UTF8, defeating the purpose of the test. Trying a raw diff. Reviewers: kevinwilfong REVISION DETAIL https://reviews.facebook.net/D9501 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D9501?vs=29967id=29973#toc AFFECTED FILES ql/src/test/results/clientpositive/orc_nonutf8.q.out ql/src/test/queries/clientpositive/orc_nonutf8.q ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java data/files/nonutf8.txt To: kevinwilfong, sxyuan Cc: JIRA ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phabricator updated HIVE-4199: -- Attachment: HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch sxyuan updated the revision HIVE-4199 [jira] ORC writer doesn't handle non-UTF8 encoded Text properly. Making the new data file binary. Reviewers: kevinwilfong REVISION DETAIL https://reviews.facebook.net/D9501 CHANGE SINCE LAST DIFF https://reviews.facebook.net/D9501?vs=29973id=30009#toc AFFECTED FILES data/files/nonutf8.txt ql/src/test/results/clientpositive/orc_nonutf8.q.out ql/src/test/queries/clientpositive/orc_nonutf8.q ql/src/java/org/apache/hadoop/hive/ql/io/orc/StringRedBlackTree.java ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java To: kevinwilfong, sxyuan Cc: JIRA ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4199) ORC writer doesn't handle non-UTF8 encoded Text properly
[ https://issues.apache.org/jira/browse/HIVE-4199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Samuel Yuan updated HIVE-4199: -- Status: Patch Available (was: Open) ORC writer doesn't handle non-UTF8 encoded Text properly Key: HIVE-4199 URL: https://issues.apache.org/jira/browse/HIVE-4199 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Samuel Yuan Assignee: Samuel Yuan Priority: Minor Attachments: HIVE-4199.HIVE-4199.HIVE-4199.D9501.1.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.2.patch, HIVE-4199.HIVE-4199.HIVE-4199.D9501.3.patch StringTreeWriter currently converts fields stored as Text objects into Strings. This can lose information (see http://en.wikipedia.org/wiki/Replacement_character#Replacement_character), and is also unnecessary since the dictionary stores Text objects. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira