[ https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532518 ]
Mamta A. Satoor commented on DERBY-2967: ---------------------------------------- I spent some time on Unicode specification and SQL specification in regards to this Jira entry and DERBY-3080. SQL specification, Section 3.1.1.1 says this for character "This is identical to the Unicode definition of abstract character. In ISO/IEC 9075, when the relevant character repertoire is UCS, a character can be thought of as that which is represented by one code point." The Unicode standard specifies a numeric value (code point) and a name for each of its characters. So, the question is, in say Norwegian, what do we call "AA"? Is it a character or something else? Unicode specificaiton has a concept of text elemenets and characters (http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf Unicode chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text Processes". Text elements are units in a text and there are several kinds of text elements, some of which are grapheme clusters("user-perceived characters"), words, sentences etc. Characters are used to represent each of these different types of text elements. Grapheme clusters are what user perceives as a single character but they may or maynot be single characters underneath. For eg, "ch" in Slovakian is perceived by user as a single character (ie a grapheme cluster) but it is composed of 2 characters "c" and "h" as 2. Another eg would be "AA" in Norwegian. Unicode treats "AA" as a grapheme cluster which is composed of 2 characters "A" and "A". (Unicode chapter 2 Figure 2.1 and http://unicode.org/reports/tr29/ Section 1). So, coming to our question of should "AA" be treated as 2 characters in like but as one character in = operation? http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf chapter 2 Section 2.1 subtopic "Text Elements, Characters, and Text Processes" also talks about how something can be a text element for one kind of text processing but not for another kind of text processing. I think what we are discussing here is that for Norwegian, we want to treat "AA" as grapheme cluster when we are using an = operator eg "AA" = 'Å' but "AA" is not a grapheme cluster when it is used in a LIKE operation eg 'AA' LIKE 'Å'. In other words, we want to use a CollationElementIterator on entire character string when we are dealing with = operator. But when working with LIKE operator, we want to generate a CollationElementIterator for one character at a time rather than one a grapheme cluster. This logic goes with what SQL spec wants us to do for = and LIKE. Quoting SQL spec for = Section 8.2 <comparison predicate> General Rules 3d) "Depending on the collation, two strings may compare as equal even if they are of different lengths or contain different sequences of characters". Quoting SQL spec fo LIKE General Rules 3cii) 1) A substring of MCV(the string in which we are trying to find a match) is a sequence of 0 (zero) or more contiguous characters of MCV and each character of MCV is part of exactly one substring. 2) If the i-th substring of PCV(the pattern that we are looking for in MCV) is an arbitrary character specifier, then the i-th substring of MCV is any single character. 3) If the i-th substring of PCV is an arbitrary string specifier, then the i-th substring of MCV is any sequence of 0 (zero) or more characters. 4) If the i-th substring of PCV is a single character specifier, then the i-th substring of MCV contains exactly 1 (one) character that is equal to the character represented by the single character specifier according to the collation of the <like predicate>. 5) The number of substrings of MCV is equal to the number of substring specifiers of PCV. Based on above, I think the like method for territory based character string types has to deal with one character at a time in pattern and in value string. If the character in pattern is not a metacharacter, then we should convert it into it's collation element(s) and compare it with the collation elements(s) of the next character in value string. If the character in pattern is _, then we should eat next element in value string. If the character in pattern is escape character, then we should look at next character in pattern and convert it into it's collation element(s) and compare it with the collation element(s) of the next character in value string. Finally, if the character in pattern is %, then we should eat 0 to n character in value string until we find a match for rest of the pattern characters or we find a mismatch following the algorithm in this paragraph. > Single character does not match high value unicode character with collation > TERRITORY_BASED > ------------------------------------------------------------------------------------------- > > Key: DERBY-2967 > URL: https://issues.apache.org/jira/browse/DERBY-2967 > Project: Derby > Issue Type: Bug > Components: SQL > Affects Versions: 10.4.0.0 > Reporter: Kathey Marsden > Assignee: Mamta A. Satoor > Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, > DERBY2967_offset_based_stat_Oct02_07.txt, fullcoll.out, > patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, > patch2_with_setOffset_stat_Sep2007.txt, step1_iteratorbased_Sep1507_diff.txt, > step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt, temp_stat.txt, > TestFrench.java, TestNorway.java > > > With TERRITORY_BASED collation '_' does not match the character \uFA2D. It > is the same for english or norwegian. FOR collation UCS_BASIC it matches > fine. Could you tell me if this is a bug? > Here is a program to reproduce. > import java.sql.*; > public class HighCharacter { > public static void main(String args[]) throws Exception > { > System.out.println("\n Territory no_NO"); > Class.forName("org.apache.derby.jdbc.EmbeddedDriver"); > Connection conn = > DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED"); > testLikeWithHighestValidCharacter(conn); > conn.close(); > System.out.println("\n Territory en_US"); > conn = > DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED"); > testLikeWithHighestValidCharacter(conn); > conn.close(); > System.out.println("\n Collation USC_BASIC"); > conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true"); > testLikeWithHighestValidCharacter(conn); > } > public static void testLikeWithHighestValidCharacter(Connection conn) throws > SQLException { > Statement stmt = conn.createStatement(); > try { > stmt.executeUpdate("drop table t1"); > }catch (SQLException se) > {// drop failure ok. > } > stmt.executeUpdate("create table t1(c11 int)"); > stmt.executeUpdate("insert into t1 values 1"); > > // \uFA2D - the highest valid character according to > // Character.isDefined() of JDK 1.4; > PreparedStatement ps = > conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?"); > String[] match = { "%", "_", "\uFA2D" }; > for (int i = 0; i < match.length; i++) { > System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]); > ps.setString(1, match[i]); > ResultSet rs = ps.executeQuery(); > if( rs.next() && rs.getString(1).equals("1")) > System.out.println("PASS"); > else System.out.println("FAIL: no match"); > rs.close(); > } > } > } > Mamta made some comments on this issue in the following thread: > http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.