Re: Collation feature discussion

Daniel John Debrunner Thu, 29 Mar 2007 13:56:04 -0800

Mamta Satoor wrote:

In Derby, there will be two character sets which will have identicalcharacter repertoire (UCS) but they may have different collationassociated with them depending on the value of JDBC url attributeCOLLATION. The 2 character sets will be1)USER character set - collation of UCS_BASIC/TERRITORY_BASED dependingon the value of jdbc url attribute COLLATION specified at createdatabase time.
2)SQL_IDENTIFIER character set - collation of UCS_BASIC.

In general I think this looks good, thanks for the work on this Mamtaand a nice summary. I think there are some finer points to be decidedbut the basic design is good. This finer points are on how a collationis derived for certain expressions, that doesn't change the overalldesign, just the input to the collation decision making. E.g. all ofthis could be coded and then changing the collation for string literalswould not affect the rest of the design.

As per SQL spec, Section 11.1<schema definition>, there is an optionalsyntax to associate a character set for a schema at create schema time.Syntax Rule 5 says that if a character set is not specified by the user,then the character set associated with schema is implementation defined.In Derby 10.3, system schemas will be associated with SQL_IDENTIFERcharacter set and all the user schemas will be associated with USERcharacter set. Futher on, General Rule 3 specifies that the characterset associated with schema is used as the default character set for all<column definitions>. Based on this, all the user character columns willpick up the collation associated with USER character set and all thesystem character columns will pick the up the collation associated withSQL_IDENTIFIER character set.The character set specification for string literals is not as welldefined as for <column definitions> but my proposal here will workwithin SQL spec boundaries. SQL spec Section 5.3<literal>, Syntax Rule14b says that if the character set is not specified for character stringliteral, then character string literal's character set will be thecharacter set of the SQL-client module. Derby does not implementSQL-client module, but definition of SQL-client module in Section 13.1says that SQL-client module definition has mandatory <module nameclause> which is defined in Section 13.2 <module name clause>. TheSyntax Rule 4 in this section says that if a character set is notspecified for the SQL-client module, then it's character set isimplementation-defined. I think we can use this implementation-definedcharacter set for a SQL-client module to our advantage. We can defineDerby's implementation-defined character set for SQL-client module ascurrent schema's character set and hence the current schema's characterset will become string literal's character set.

Interesting. The thing that jumped out at me is that this effectivelymeans that the character set for the SQL-client module depends on thesession's state (its current schema). That just seems strange.

I came across a couple more pieces of information from the SQL Spec thatmay or may not help. :-)


4.37.3

"An SQL-session has a default character set name" that isimplementation defined.


18.7

"Set the default character set name for <character string literal>s in<preparable statement>s that are prepared

in the current SQL-session ..."

Not sure that helps, but does allow us to pick a single defaultcharacter set for a session (though it couldn't change based uponcurrent schema).


Then the SQL session also has a collation, 4.37.3 again:

"For each character set known to the SQL-implementation, anSQL-session has at most one SQL-session collationfor that character set, to be used when the rules of Subclause 9.13,“Collation determination”, are applied. There

are no SQL-session collations at the start of an SQL-session."

but that there are "no SQL-session collations at the start" means thatit's pointless without a <set session collation statement>, i.e. 9.13SR3c) does not apply to Derby.

3)<character string type> (SQL spec section 6.1 <data type> Syntax Rule3b and 16) - Rule 3b says that collation type of character string typeis the character set's collation AND rule 16 says that if <characterstring type> is not contained in a <column definition>, then animplementation-defined character set is associated with the <characterstring type>. We can define Derby's implementation-defined characterset for such <character string type> to be current schema's characterset. The collation derivation will be implicit.

Does 3) cover JDBC parameters as well, (ie. ?) where the type of theparameter is a character type?

6)CHAR, VARCHAR functions do not look like they are defined in the SQLspec. But based on 5) [TRIM etc.] above, the result character string type'scollation can be considered same as the first argument's collation typeif the first argument to CHAR/VARCHAR function is a character stringtype. If the first argument is not character string type, then theresult character string of CHAR/VARCHAR will have the same collation ascurrent schema's character set. The collation derivation will be implicit.


This approach means that CHAR(varchar_col, 20) behaves differently to

CAST (varchar_col AS CHAR(20)). Not sure if that's good or bad, but theymight be implemented today using the same code path.


Dan.

Re: Collation feature discussion

Reply via email to