[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-18 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973880#comment-13973880
 ] 

Jason Dere commented on HIVE-6843:
--

The string literal does interpret \uD801 as a single character, and 
\uD801\uDC00 as a single code point (got the example character from 
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html):

{noformat}
String str1 = 123\uD801\uDC00456;
System.out.println(str1 length= + str1.length() + , codePointCount= + 
str1.codePointCount(0, str1.length()));

str1 length=8, codePointCount=7
{noformat}

So if we count things by unicode code points, the 4 would be at index 4 (for 
0-based index).


 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-18 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973882#comment-13973882
 ] 

Jason Dere commented on HIVE-6843:
--

From that link, \uD801\uDC00 would be the representation for U+10400

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-18 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974664#comment-13974664
 ] 

Jason Dere commented on HIVE-6843:
--

I think it looks good, +1

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.2.patch, HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13974739#comment-13974739
 ] 

Hive QA commented on HIVE-6843:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12640922/HIVE-6843.2.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 5406 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
{noformat}

Test results: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/testReport
Console output: http://bigtop01.cloudera.org:8080/job/precommit-hive/25/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12640922

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.2.patch, HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-17 Thread Jason Dere (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973566#comment-13973566
 ] 

Jason Dere commented on HIVE-6843:
--

Should this also work for unicode characters which require more than one Java 
character? If you add these checks to TestGenericUDFUtils, the 2nd check fails:
{code}
Assert.assertEquals(3, GenericUDFUtils.findText(new 
Text(123\uD801\uDC00456), new Text(\uD801\uDC00), 0));
Assert.assertEquals(4, GenericUDFUtils.findText(new 
Text(123\uD801\uDC00456), new Text(4), 0));
{code}

This would require using String.codePointCount() on the indexOf() result.

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-17 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973730#comment-13973730
 ] 

Szehon Ho commented on HIVE-6843:
-

Thanks for the review.  As I understand, you are passing in a string literal to 
Text constructor, so it is not interpreting \uD801 as one char, so there is 
actually 5 chars there: '\', 'u', 'D', '8', '0', '1'.

I tried the following test and it seemed to work:

char[] chararray = new char[] {'1', '2', '3', '\uD801', '\uDC00', '4', '5', 
'6'};
String str = new String(chararray);
Assert.assertEquals(5, GenericUDFUtils.findText(new Text(str), new 
Text(4), 0));

I guess the second check was supposed to be 5, not 4.

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-07 Thread Clif Kranish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961957#comment-13961957
 ] 

Clif Kranish commented on HIVE-6843:


Sorry, copy/paste got me. They _look_ the same. And sorry about the curly 
quotes, I don't know where they came from. 

The real issue is that for UTF-8 INSTR returns the position in bytes instead of 
characters. So this reutrns a 9 where by my count it should be a 5. Thank you 
for your support. 

select INSTR ('НАСТРОЕние', 'Р') from  

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-07 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962583#comment-13962583
 ] 

Hive QA commented on HIVE-6843:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12639052/HIVE-6843.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 5549 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/2170/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12639052

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Assignee: Szehon Ho
Priority: Minor
 Attachments: HIVE-6843.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-06 Thread Szehon Ho (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13961551#comment-13961551
 ] 

Szehon Ho commented on HIVE-6843:
-

Hi, I was going to look at this , but when I tried it looks like that your 
first P is a Cyrillic P (d0,a0), while the second is a English P (50).  Can you 
verify?  If you make the second a Cyrlilic P, than it works.

 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6843) INSTR for UTF-8 returns incorrect position

2014-04-04 Thread Clif Kranish (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13960354#comment-13960354
 ] 

Clif Kranish commented on HIVE-6843:


Using the INSTR fuction to find the posistion of a substring for a UTF-8 
returns zero

select INSTR (‘НАСТРОЕние’, ‘P’) from foo-bar 


 INSTR for UTF-8 returns incorrect position
 --

 Key: HIVE-6843
 URL: https://issues.apache.org/jira/browse/HIVE-6843
 Project: Hive
  Issue Type: Bug
  Components: UDF
Affects Versions: 0.11.0, 0.12.0
Reporter: Clif Kranish
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)