[jira] Commented: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED

Mamta A. Satoor (JIRA) Thu, 04 Oct 2007 13:26:22 -0700

    [ 
https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532518
 ]


Mamta A. Satoor commented on DERBY-2967:
----------------------------------------

I spent some time on Unicode specification and SQL specification in regards to 
this Jira entry and DERBY-3080. SQL specification, Section 3.1.1.1 says this 
for character "This is identical to the Unicode definition of abstract 
character. In ISO/IEC 9075, when the relevant character repertoire is UCS, a 
character can be thought of as that which is represented by one code point." 
The Unicode standard specifies a numeric value (code point) and a name for each 
of its characters. 

So, the question is, in say Norwegian, what do we call "AA"? Is it a character 
or something else? Unicode specificaiton has a concept of text elemenets and 
characters (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf Unicode 
chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text Processes". 
Text elements are units in a text and there are several kinds of text elements, 
some of which are grapheme clusters("user-perceived characters"), words, 
sentences etc. Characters are used to represent each of these different types 
of text elements. Grapheme clusters are what user perceives as a single 
character but they may or maynot be single characters underneath. For eg, "ch" 
in Slovakian is perceived by user as a single character (ie a grapheme cluster) 
but it is composed of 2 characters "c" and "h" as 2. Another eg would be "AA" 
in Norwegian. Unicode treats "AA" as a grapheme cluster which is composed of 2 
characters "A" and "A". (Unicode chapter 2 Figure 2.1 and 
http://unicode.org/reports/tr29/ Section 1). 

So, coming to our question of should "AA" be treated as 2 characters in like 
but as one character in = operation? 
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf chapter 2 Section 2.1 
subtopic "Text Elements, Characters, and Text Processes" also talks about how 
something can be a text element for one kind of text processing but not for 
another kind of text processing. I think what we are discussing here is that 
for Norwegian, we want to treat "AA" as grapheme cluster when we are using an = 
operator eg "AA" = 'Å' but "AA" is not a grapheme cluster when it is used in a 
LIKE operation eg 'AA' LIKE 'Å'. In other words, we want to use a 
CollationElementIterator on entire character string when we are dealing with = 
operator. But when working with LIKE operator, we want to generate a 
CollationElementIterator for one character at a time rather than one a grapheme 
cluster. This logic goes with what SQL spec wants us to do for = and LIKE. 

Quoting SQL spec for = Section 8.2 <comparison predicate> General Rules 3d) 
"Depending on the collation, two strings may compare as equal even if they are 
of different lengths or contain different sequences of characters". 

Quoting SQL spec fo LIKE General Rules 3cii)
1) A substring of MCV(the string in which we are trying to find a match) is a 
sequence of 0 (zero) or more contiguous characters of MCV and each character of 
MCV is part of exactly one substring.
2) If the i-th substring of PCV(the pattern that we are looking for in MCV) is 
an arbitrary character specifier, then the i-th substring of MCV is any single 
character.
3) If the i-th substring of PCV is an arbitrary string specifier, then the i-th 
substring of MCV is any sequence of 0 (zero) or more characters.
4) If the i-th substring of PCV is a single character specifier, then the i-th 
substring of MCV contains exactly 1 (one) character that is equal to the 
character represented by the single character specifier according to the 
collation of the <like predicate>.
5) The number of substrings of MCV is equal to the number of substring 
specifiers of PCV.

Based on above, I think the like method for territory based character string 
types has to deal with one character at a time in pattern and in value string. 
If the character in pattern is not a metacharacter, then we should convert it 
into it's collation element(s) and compare it with the collation elements(s) of 
the next character in value string. If the character in pattern is _, then we 
should eat next element in value string. If the character in pattern is escape 
character, then we should look at next character in pattern and convert it into 
it's collation element(s) and compare it with the collation element(s) of the 
next character in value string. Finally, if the character in pattern is %, then 
we should eat 0 to n character in value string until we find a match for rest 
of the pattern characters or we find a mismatch following the algorithm in this 
paragraph.


> Single character does not match high value unicode character with collation 
> TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2967
>                 URL: https://issues.apache.org/jira/browse/DERBY-2967
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.4.0.0
>            Reporter: Kathey Marsden
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, 
> DERBY2967_offset_based_stat_Oct02_07.txt, fullcoll.out, 
> patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, 
> patch2_with_setOffset_stat_Sep2007.txt, step1_iteratorbased_Sep1507_diff.txt, 
> step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt, temp_stat.txt, 
> TestFrench.java, TestNorway.java
>
>
> With TERRITORY_BASED collation '_' does not match  the character \uFA2D.  It 
> is the same for english or norwegian. FOR collation UCS_BASIC it matches 
> fine.  Could you tell me if this is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn = 
> DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn = 
> DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>    }
> public static  void testLikeWithHighestValidCharacter(Connection conn) throws 
> SQLException {
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>  
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>      String[] match = { "%", "_", "\uFA2D" };
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else          System.out.println("FAIL: no match");
>    rs.close();
>    }
>   }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED

Reply via email to