>>>>> "Peter" == Peter Eisentraut <pete...@gmx.net> writes:

 > On Tuesday 14 April 2009 07:07:27 Andrew Gierth wrote:
 >> FWIW, the SQL spec puts the onus of normalization squarely on the
 >> application; the database is allowed to assume that Unicode
 >> strings are already normalized, is allowed to behave in
 >> implementation-defined ways when presented with strings that
 >> aren't normalized, and provision of normalization functions and
 >> predicates is just another optional feature.

 Peter> Can you name chapter and verse on that?

4.2.8 Universal character sets

  A UCS string is a character string whose character repertoire is UCS
  and whose character encoding form is one of UTF8, UTF16, or
  UTF32. Any two UCS strings are comparable.

  An SQL-implementation may assume that all UCS strings are normalized
  in one of Normalization Form C (NFC), Normalization Form D (NFD),
  Normalization Form KC (NFKC), or Normalization Form KD (NFKD), as
  specified by [Unicode15]. <normalized predicate> may be used to
  verify the normalization form to which a particular UCS string
  conforms. Applications may also use <normalize function> to enforce
  a particular <normal form>. With the exception of <normalize function>
  and <normalized predicate>, the result of any operation on an
  unnormalized UCS string is implementation-defined.

  Conversion of UCS strings from one character set to another is
  automatic.

  Detection of a noncharacter in a UCS-string causes an exception
  condition to be raised. The detection of an unassigned code point
  does not.

[Obviously there are things here that we don't conform to anyway (we
don't raise exceptions for noncharacters, for example. We don't claim
conformance to T061.]

<normalized predicate> ::=
  <row value predicand> <normalized predicate part 2>
<normalized predicate part 2> ::=
  IS [ NOT ] [ <normal form> ] NORMALIZED

1) Without Feature T061, "UCS support", conforming SQL language shall
   not contain a <normalized predicate>.

2) Without Feature F394, "Optional normal form specification",
   conforming SQL language shall not contain <normal form>.

<normalize function> ::=
  NORMALIZE <left paren> <character value expression>
      [ <comma> <normal form> [ <comma> <normalize function result length> ] ] 
<right paren>

<normal form> ::=
    NFC
  | NFD
  | NFKC
  | NFKD

7) Without Feature T061, "UCS support", conforming SQL language shall
   not contain a <normalize function>.

9) Without Feature F394, "Optional normal form specification",
   conforming SQL language shall not contain <normal form>.

 Peter> I see this, for example,

 Peter> 6.27 <numeric value function>
 [...]
 Peter> So SQL redirects the question of character length the Unicode
 Peter> standard.  I have not been able to find anything there on a
 Peter> quick look, but I'm sure the Unicode standard has some very
 Peter> specific ideas on this.  Note that the matter of normalization
 Peter> is not mentioned here.

I've taken a not-so-quick look at the Unicode standard (though I don't
claim to be any sort of expert on it), and I certainly can't see any
definitive indication what the length is supposed to be; however, the
use of terminology such as "combining character sequence" (meaning a
series of codepoints that combine to make a single glyph) certainly
seems to strongly imply that our interpretation is correct and that
the OP's is not.

Other indications: the units used by length() must be the same as the
units used by position() and substring() (in the spec, when USING
CHARACTERS is specified), and it would not make sense to use a
definition of "character" that did not allow you to look inside a
combining sequence.

I've also failed so far to find any examples of other programming
languages in which a combining character sequence is taken to be a
single character for purposes of length or position specification.

-- 
Andrew (irc:RhodiumToad)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to