Re: ora_can_unicode discussion

Tim Bunce Tue, 23 Mar 2004 16:03:03 -0800

On Tue, Mar 23, 2004 at 10:37:14PM -0000, Andy Hassall wrote:
> Tim Bunce wrote:
> 
> > I'm also not too worried about the client-side character set. I
> > figure we should ask for anything that's unicode on the server-side
> > to be given to us as unicode and let perl deal with converting the
> > unicode to whatever encoding the application is using.
> 
>  Shouldn't this be the other way around, at least for DBD::Oracle - it's
> _all_ about the client character set.
> 
>  As far as fetching goes, it doesn't matter what the server character set is
> so long as your client character set is equal or a superset of it. Unicode's
> a superset of pretty much everything, so having your NLS_LANG set to .UTF8
> means you won't be losing data.


Sure. I was addressing the case when NLS_LANG is *not* utf8. But
basically I'd forgotten about the NLS_NCHAR env var you mention below.
That lets DBD::Oracle off the hook in the sense that we can just
tell people to set that to utf8 if they want NCHAR types as utf8.

>  For binding, you should make sure you only bind characters in the
> intersection of client and database character sets else it'll be a lossy
> conversion, but the binding's always done in the client character set (at
> least at the moment - see next bit).
> 
>  This is what I think of as far as 'UTF-8 support in DBD::Oracle' would be -
> am I on the right track here? :
> 
> ===
> 
> 1. If your NLS_LANG's character set is anything other than UTF8:
>       1a. All data fetched comes is sent to Perl unaltered from that
> fetched by OCI; in the client character set (may have been recoded by Oracle
> but that's transparent).
>       1b. If a Perl string with the utf8 flag is bound to a statement, it
> is bound as UTF8 rather than the client character set. Otherwise it is bound
> as normal (in the client character set).
> 
> 2. If your NLS_LANG is set to .UTF8:
>       2a. All data fetched comes back with the Perl utf8 flag set, as it
> is known to be valid UTF8 since Oracle converts it (if necessary; it may
> have originally been Unicode on the server, but that's transparent from the
> client side).
>       2b. All data bound is bound as UTF8, whether it has the Perl utf8
> flag or not.
> 
> ===
> 
>  (National character set only affects the above if you have NLS_NCHAR set,
> i.e. you have a different client national character set to your main client
> character set).
> 
>  Apart from 1b, DBD::Oracle appears to be doing most of this at least in the
> last svn revision I tried? "1b" /should/ just be a matter of setting an
> OCI_ATTR_CHARSET_ID appropriately on the bind handle if the Perl utf8 flag
> is set.
> 
>  I think this ties in with Tim's points on the other character set thread:
> 
> > 3. I don't really want the DBI to be involved in any recoding
> >    of character sets (from client charset to server charset)
> >    and I suggest that the drivers don't try to do that either.
> 
> > 5. When selecting data from the database the driver should:
> >    - return strings which have a unicode character set as UTF8.
> >    - return strings with other character sets as-is (unchanged) on
> >      the presumption that the application knows what to do with it.
> 
>  Sounds right to me. I don't think it should be trying to turn other Unicode
> encodings to UTF-8 as I think I read in one of the other mails; if you have
> NLS_LANG set to .UTF16 (or whatever the full code is), then you should get
> UTF-16 strings without the Perl utf8 flag set. Only NLS_LANG=.UTF8 should
> result in utf8-flagged Perl strings being returned, as that's the only
> encoding Perl really supports.

Sounds good. People living in a .UTF16 world can set NLS_LANG=.UTF8
in their perl scripts before connecting.

> > 8. When passing data to the database (including the SQL statement)
> >    the driver should (perhaps) warn if it's presented with UTF8
> >    strings but the database or database can't handle unicode.
> 
>  Whether Oracle can handle being sent Unicode is not an all-or-nothing
> thing; this is where it depends on the database character set, and so back
> to ora_can_unicode.
> 
>  The question that ora_can_unicode answers is "Can I send ANY Unicode
> character and be confident it will be stored without corruption?". The
> original test failure was because it was trying to store a character that
> wasn't representable in the target database.
> 
>  A more refined question would be (clearly optional to utf8 support but
> seems a useful support function):
> 
>  "Given the current combination of client character set, and whether the
> utf8 flag is set on the Perl string, can I store this value without data
> loss, either in the database charset, or the national charset, or both?".
> 
>  $dbh->ora_can_store_string($string) perhaps? Bitmask return value as per
> ora_can_unicode?
> 
>  There are OCI functions that can answer this, e.g. OCINlsCharSetConvert()
> followed by OCICharSetConversionIsReplacementUsed().

Patches most welcome :)

>  The Euro symbol is a good example for this question, since it's either not
> present, or in completely different places in the most popular character
> sets.
> 
>  e.g. Binding "\x{20ac}" should be fine so long as your database is in UTF8,
> one of the other Unicode sets, or WE8ISO8859P15 or WE8MSWIN1252. But not if
> it's WE8ISO8859P1 (Latin-1) - it doesn't have a Euro symbol.
> 
>  If you try and bind its single-byte equivalent, chr(128) or chr(164), it
> depends on your client character set as well as the database character set.
> 
>  Hope I'm making some sense :-)

I think so. Thanks!

Tim.

Re: ora_can_unicode discussion

Reply via email to