[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973880#comment-13973880 ] Jason Dere commented on HIVE-6843: -- The string literal does interpret \uD801 as a single character, and \uD801\uDC00 as a single code point (got the example character from http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html): {noformat} String str1 = 123\uD801\uDC00456; System.out.println(str1 length= + str1.length() + , codePointCount= + str1.codePointCount(0, str1.length())); str1 length=8, codePointCount=7 {noformat} So if we count things by unicode code points, the 4 would be at index 4 (for 0-based index). INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973882#comment-13973882 ] Jason Dere commented on HIVE-6843: -- From that link, \uD801\uDC00 would be the representation for U+10400 INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974664#comment-13974664 ] Jason Dere commented on HIVE-6843: -- I think it looks good, +1 INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.2.patch, HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974739#comment-13974739 ] Hive QA commented on HIVE-6843: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12640922/HIVE-6843.2.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 5406 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 {noformat} Test results: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/testReport Console output: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12640922 INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.2.patch, HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973566#comment-13973566 ] Jason Dere commented on HIVE-6843: -- Should this also work for unicode characters which require more than one Java character? If you add these checks to TestGenericUDFUtils, the 2nd check fails: {code} Assert.assertEquals(3, GenericUDFUtils.findText(new Text(123\uD801\uDC00456), new Text(\uD801\uDC00), 0)); Assert.assertEquals(4, GenericUDFUtils.findText(new Text(123\uD801\uDC00456), new Text(4), 0)); {code} This would require using String.codePointCount() on the indexOf() result. INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973730#comment-13973730 ] Szehon Ho commented on HIVE-6843: - Thanks for the review. As I understand, you are passing in a string literal to Text constructor, so it is not interpreting \uD801 as one char, so there is actually 5 chars there: '\', 'u', 'D', '8', '0', '1'. I tried the following test and it seemed to work: char[] chararray = new char[] {'1', '2', '3', '\uD801', '\uDC00', '4', '5', '6'}; String str = new String(chararray); Assert.assertEquals(5, GenericUDFUtils.findText(new Text(str), new Text(4), 0)); I guess the second check was supposed to be 5, not 4. INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961957#comment-13961957 ] Clif Kranish commented on HIVE-6843: Sorry, copy/paste got me. They _look_ the same. And sorry about the curly quotes, I don't know where they came from. The real issue is that for UTF-8 INSTR returns the position in bytes instead of characters. So this reutrns a 9 where by my count it should be a 5. Thank you for your support. select INSTR ('НАСТРОЕние', 'Р') from INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962583#comment-13962583 ] Hive QA commented on HIVE-6843: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12639052/HIVE-6843.patch {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 5549 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12639052 INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Assignee: Szehon Ho Priority: Minor Attachments: HIVE-6843.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961551#comment-13961551 ] Szehon Ho commented on HIVE-6843: - Hi, I was going to look at this , but when I tried it looks like that your first P is a Cyrillic P (d0,a0), while the second is a English P (50). Can you verify? If you make the second a Cyrlilic P, than it works. INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position
[ https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960354#comment-13960354 ] Clif Kranish commented on HIVE-6843: Using the INSTR fuction to find the posistion of a substring for a UTF-8 returns zero select INSTR (‘НАСТРОЕние’, ‘P’) from foo-bar INSTR for UTF-8 returns incorrect position -- Key: HIVE-6843 URL: https://issues.apache.org/jira/browse/HIVE-6843 Project: Hive Issue Type: Bug Components: UDF Affects Versions: 0.11.0, 0.12.0 Reporter: Clif Kranish Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)