Re: AL32UTF8

2004-04-30 Thread Lincoln A. Baxter
On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> Am I right in thinking that perl's internal utf8 representation
> represents surrogates as a single (4 byte) code point and not as
> two separate code points?
> 
> This is the form that Oracle call AL32UTF8.
> 
> What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> byte string that used surrogates? Would there be problems?
> (For example, a string returned from Oracle when using the UTF8
> character set instead of the newer AL32UTF8 one.)
> 
I think it makes no difference. (at least I could no find one), except
for the internal storage.  Several of the tests I wrote print a sql
DUMP(nch), and you can see the difference in the internal store in those
prints.  The strings come back to the client, the way they were put in.

I have tested this with 4 databases

dbcharset/ncharset
- 
us7ascii/utf8
us7ascii/all6utf16
utf8/utf8
utf8/al16utf16

All tests produce the same results with all databases using both .UTF8
and .AL32UTF8 in NLS_LANG.

Lincoln




Re: AL32UTF8

2004-04-30 Thread Larry Wall
On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
: Tim Bunce wrote:
: 
: > Am I right in thinking that perl's internal utf8 representation
: > represents surrogates as a single (4 byte) code point and not as
: > two separate code points?
: 
: Mmmh.  Right and wrong... as a single code point, yes, since the real
: UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
: 3 bytes.

No, Tim's right--they're four bytes.  It's only the individual
surrogates that would come out to three bytes.  The break between
three and four bytes is between \x{} and \x{1}.

Larry


Re: AL32UTF8

2004-04-30 Thread Jarkko Hietaniemi
> 
> Okay. Thanks.
> 
> Basically I need to document that Oracle "AL32UTF8" should be used
> as the client charset in preference to the older "UTF8" because
> "UTF8" doesn't do the "best"? thing with surrogate pairs.

"because what Oracle calls UTF8 is not conformant with the modern
definition of UTF8"

> Seems like "best" is the, er, best word to use here as "right"
> would be too strong. But then the "shortest form" requirement
> is quite strong so perhaps "modern standard" would be the right words.
> 
> Tim.


-- 
Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


Re: AL32UTF8

2004-04-30 Thread Tim Bunce
[The background to this is that Lincoln and I have been working on
Unicode support for DBD::Oracle. (Actually Lincoln's done most of
the heavy lifting, I've mostly been setting goals and directions
at the DBI API level and scratching at edge cases. Like this one.)]

On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
> Tim Bunce wrote:
> 
> > Am I right in thinking that perl's internal utf8 representation
> > represents surrogates as a single (4 byte) code point and not as
> > two separate code points?
> 
> Mmmh.  Right and wrong... as a single code point, yes, since the real
> UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
> 3 bytes.
> 
> > This is the form that Oracle call AL32UTF8.
> 
> Does this
> 
> http://www.unicode.org/reports/tr26/
> 
> look like like Oracle's older (?) UTF8?

"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value."

Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.

(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name "CESU-8" to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)

> > What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> > byte string that used surrogates? Would there be problems?
> 
> You would get out the surrogate code points from the sv, not the
> supplementary plane code point the surrogate pairs are encoding.
> Depends what you do with the data: this might be okay, might not.
> Since it's valid UTF-8, nothing should croak perl-side.

Okay. Thanks.

Basically I need to document that Oracle "AL32UTF8" should be used
as the client charset in preference to the older "UTF8" because
"UTF8" doesn't do the "best"? thing with surrogate pairs.

Seems like "best" is the, er, best word to use here as "right"
would be too strong. But then the "shortest form" requirement
is quite strong so perhaps "modern standard" would be the right words.

Tim.


Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
> On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> > Am I right in thinking that perl's internal utf8 representation
> > represents surrogates as a single (4 byte) code point and not as
> > two separate code points?
> > 
> > This is the form that Oracle call AL32UTF8.
> > 
> > What would be the effect of setting SvUTF8_on(sv) on a valid utf8
> > byte string that used surrogates? Would there be problems?
> > (For example, a string returned from Oracle when using the UTF8
> > character set instead of the newer AL32UTF8 one.)
>
> I think it makes no difference. (at least I could no find one), except
> for the internal storage.  Several of the tests I wrote print a sql
> DUMP(nch), and you can see the difference in the internal store in those
> prints.  The strings come back to the client, the way they were put in.
> 
> I have tested this with 4 databases
> 
> dbcharset/ncharset
> - 
> us7ascii/utf8
> us7ascii/all6utf16
> utf8/utf8
> utf8/al16utf16
> 
> All tests produce the same results with all databases using both .UTF8
> and .AL32UTF8 in NLS_LANG.

Were you using characters that require surrogates in UTF16?
If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.

Here's a relevant quote from the Oracle 9.2 docs at
http://www.dbis.informatik.uni-goettingen.de/Teaching/oracle-doc/server.920/a96529/ch6.htm#1005295

"You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
applications. If you do not need supplementary characters, then it
does not matter whether you choose UTF8 or AL32UTF8. However, if
your OCI applications might handle supplementary characters, then
you need to make a decision. Because UTF8 can require up to three
bytes for each character, one supplementary character is represented
in two code points, totalling six bytes. In AL32UTF8, one supplementary
character is represented in one code point, totalling four bytes."

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.

Tim.

p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.


Re: AL32UTF8

2004-04-30 Thread Martin Hosken
Dear Tim,

"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value."
Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.
(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name "CESU-8" to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)
IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
Unicode) because they were storing higher plane codes using the 
surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
char of 4+ bytes. There is no real trouble doing it that way since 
anyone can convert between the 'wrong' UTF-8 and the correct form. But 
they found that if you do a simple binary based sort of a string in 
AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
different order. On this basis they made request to the UTC to have 
AL32UTF8 added as a kludge and out of the kindness of their hearts the 
UTC agreed thus saving Oracle from a whole heap of work. But all are 
agreed that UTF-8 and not AL32UTF8 is the way forward.

Yours,
Martin


Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote:
> > 
> > Okay. Thanks.
> > 
> > Basically I need to document that Oracle "AL32UTF8" should be used
> > as the client charset in preference to the older "UTF8" because
> > "UTF8" doesn't do the "best"? thing with surrogate pairs.
> 
> "because what Oracle calls UTF8 is not conformant with the modern
> definition of UTF8"

Thanks Jarkko.

Tim.

> > Seems like "best" is the, er, best word to use here as "right"
> > would be too strong. But then the "shortest form" requirement
> > is quite strong so perhaps "modern standard" would be the right words.
> > 
> > Tim.
> 
> 
> -- 
> Jarkko Hietaniemi <[EMAIL PROTECTED]> http://www.iki.fi/jhi/ "There is this special
> biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote:
> Dear Tim,
> 
> >"CESU-8 defines an encoding scheme for Unicode identical to UTF-8
> >except for its representation of supplementary characters. In CESU-8,
> >supplementary characters are represented as six-byte sequences
> >resulting from the transformation of each UTF-16 surrogate code
> >unit into an eight-bit form similar to the UTF-8 transformation, but
> >without first converting the input surrogate pairs to a scalar value."
> >
> >Yes, that sounds like it.  But see my quote from Oracle docs in my
> >reply to Lincoln's email to make sure.
> >
> >(I presume it dates from before UTF16 had surrogate pairs. When
> >they were added to UTF16 they gave a name "CESU-8" to what old UTF16
> >to UTF8 conversion code would produce when given surrogate pairs.
> >A classic standards maneuver :)
> 
> IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
> Unicode) because they were storing higher plane codes using the 
> surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
> 2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
> char of 4+ bytes. There is no real trouble doing it that way since 
> anyone can convert between the 'wrong' UTF-8 and the correct form. But 
> they found that if you do a simple binary based sort of a string in 
> AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
> different order. On this basis they made request to the UTC to have 
> AL32UTF8 added as a kludge and out of the kindness of their hearts the 
> UTC agreed thus saving Oracle from a whole heap of work. But all are 
> agreed that UTF-8 and not AL32UTF8 is the way forward.

Um, now you've confused me.

The Oracle docs say "In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes." which you
say is "correct UTF-8 way". So the old Oracle ``UTF8'' charset
is what's now called "CESU-8" and what Oracle call ``AL32UTF8''
is the "correct UTF-8 way".

So did you mean CESU-8 when you said AL32UTF8?

Tim.