On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote: > On Fri, 2004-04-30 at 08:03, Tim Bunce wrote: > > On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote: > > > On Thu, 2004-04-29 at 11:16, Tim Bunce wrote: > > > > Am I right in thinking that perl's internal utf8 representation > > > > represents surrogates as a single (4 byte) code point and not as > > > > two separate code points? > > > > > > > > This is the form that Oracle call AL32UTF8. > > > > [snip] > > > > Were you using characters that require surrogates in UTF16? > > If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8. > > Hmmm...err.. probably not... I guess I need to hunt one up.
There is only one case in which 3 and 4 byte characters can be round tripped. After a bunch of other changes and fixups, I tested with the following two new totally invented (by me) super wide characters: row: 8: nice_string=\x{32263A} byte_string=248|140|162|152|186 (3 byte wide char) row: 9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide char) In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or AL32UTF8) row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 Values can NOT be round tripped. In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8) row 8: nch=Typ=1 Len=15: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189 row 9: nch=Typ=1 Len=18: 239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191 Values can NOT be round tripped. In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows: row 8: nch=Typ=1 Len=5: 248,140,162,152,186 row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186 Values CAN be round tripped! So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16 And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8. [snip] > Seems reasonable. I think you made a good point about the cost of > crawling through the data. I'm convinced. If you have not already > changed it, I will. > > > p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG > > and NLS_NCHAR are not defined then we should use AL32UTF8 if possible. > > I changed that last night (to use AL32UTF8). But given the above results... perhaps I should change it back. Lincoln