On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote:
> On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
> > On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
> > > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
> > > > Am I right in thinking that perl's internal utf8 representation
> > > > represents surrogates as a single (4 byte) code point and not as
> > > > two separate code points?
> > > > 
> > > > This is the form that Oracle call AL32UTF8.
> > > > 
[snip]
> > 
> > Were you using characters that require surrogates in UTF16?
> > If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.
> 
> Hmmm...err.. probably not... I guess I need to hunt one up.

There is only one case in which 3 and 4 byte characters can be round
tripped.  After a bunch of other changes and fixups, I tested with the
following two new totally invented (by me) super wide characters:

row:   8: nice_string=\x{32263A}   byte_string=248|140|162|152|186     (3 byte wide 
char)
row:   9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide 
char)

In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or 
AL32UTF8)

        row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 
        row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 
        
        Values can NOT be round tripped.

In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8)

        row 8: nch=Typ=1 Len=15: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189  
        row 9: nch=Typ=1 Len=18: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191
        
        Values can NOT be round tripped.

In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows:

        row 8: nch=Typ=1 Len=5: 248,140,162,152,186
        row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186
        
        Values CAN be round tripped!
        
So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16
And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8.

[snip]
> Seems reasonable.  I think you made a good point about the cost of
> crawling through the data. I'm convinced. If you have not already
> changed it, I will. 
> 
> > p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
> > and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.
> 
> I changed that last night (to use AL32UTF8).

But given the above results... perhaps I should change it back.

Lincoln


Reply via email to