On Tue, Mar 23, 2004 at 10:37:14PM -0000, Andy Hassall wrote: > Tim Bunce wrote: > > > I'm also not too worried about the client-side character set. I > > figure we should ask for anything that's unicode on the server-side > > to be given to us as unicode and let perl deal with converting the > > unicode to whatever encoding the application is using. > > Shouldn't this be the other way around, at least for DBD::Oracle - it's > _all_ about the client character set. > > As far as fetching goes, it doesn't matter what the server character set is > so long as your client character set is equal or a superset of it. Unicode's > a superset of pretty much everything, so having your NLS_LANG set to .UTF8 > means you won't be losing data.
Sure. I was addressing the case when NLS_LANG is *not* utf8. But basically I'd forgotten about the NLS_NCHAR env var you mention below. That lets DBD::Oracle off the hook in the sense that we can just tell people to set that to utf8 if they want NCHAR types as utf8. > For binding, you should make sure you only bind characters in the > intersection of client and database character sets else it'll be a lossy > conversion, but the binding's always done in the client character set (at > least at the moment - see next bit). > > This is what I think of as far as 'UTF-8 support in DBD::Oracle' would be - > am I on the right track here? : > > === > > 1. If your NLS_LANG's character set is anything other than UTF8: > 1a. All data fetched comes is sent to Perl unaltered from that > fetched by OCI; in the client character set (may have been recoded by Oracle > but that's transparent). > 1b. If a Perl string with the utf8 flag is bound to a statement, it > is bound as UTF8 rather than the client character set. Otherwise it is bound > as normal (in the client character set). > > 2. If your NLS_LANG is set to .UTF8: > 2a. All data fetched comes back with the Perl utf8 flag set, as it > is known to be valid UTF8 since Oracle converts it (if necessary; it may > have originally been Unicode on the server, but that's transparent from the > client side). > 2b. All data bound is bound as UTF8, whether it has the Perl utf8 > flag or not. > > === > > (National character set only affects the above if you have NLS_NCHAR set, > i.e. you have a different client national character set to your main client > character set). > > Apart from 1b, DBD::Oracle appears to be doing most of this at least in the > last svn revision I tried? "1b" /should/ just be a matter of setting an > OCI_ATTR_CHARSET_ID appropriately on the bind handle if the Perl utf8 flag > is set. > > I think this ties in with Tim's points on the other character set thread: > > > 3. I don't really want the DBI to be involved in any recoding > > of character sets (from client charset to server charset) > > and I suggest that the drivers don't try to do that either. > > > 5. When selecting data from the database the driver should: > > - return strings which have a unicode character set as UTF8. > > - return strings with other character sets as-is (unchanged) on > > the presumption that the application knows what to do with it. > > Sounds right to me. I don't think it should be trying to turn other Unicode > encodings to UTF-8 as I think I read in one of the other mails; if you have > NLS_LANG set to .UTF16 (or whatever the full code is), then you should get > UTF-16 strings without the Perl utf8 flag set. Only NLS_LANG=.UTF8 should > result in utf8-flagged Perl strings being returned, as that's the only > encoding Perl really supports. Sounds good. People living in a .UTF16 world can set NLS_LANG=.UTF8 in their perl scripts before connecting. > > 8. When passing data to the database (including the SQL statement) > > the driver should (perhaps) warn if it's presented with UTF8 > > strings but the database or database can't handle unicode. > > Whether Oracle can handle being sent Unicode is not an all-or-nothing > thing; this is where it depends on the database character set, and so back > to ora_can_unicode. > > The question that ora_can_unicode answers is "Can I send ANY Unicode > character and be confident it will be stored without corruption?". The > original test failure was because it was trying to store a character that > wasn't representable in the target database. > > A more refined question would be (clearly optional to utf8 support but > seems a useful support function): > > "Given the current combination of client character set, and whether the > utf8 flag is set on the Perl string, can I store this value without data > loss, either in the database charset, or the national charset, or both?". > > $dbh->ora_can_store_string($string) perhaps? Bitmask return value as per > ora_can_unicode? > > There are OCI functions that can answer this, e.g. OCINlsCharSetConvert() > followed by OCICharSetConversionIsReplacementUsed(). Patches most welcome :) > The Euro symbol is a good example for this question, since it's either not > present, or in completely different places in the most popular character > sets. > > e.g. Binding "\x{20ac}" should be fine so long as your database is in UTF8, > one of the other Unicode sets, or WE8ISO8859P15 or WE8MSWIN1252. But not if > it's WE8ISO8859P1 (Latin-1) - it doesn't have a Euro symbol. > > If you try and bind its single-byte equivalent, chr(128) or chr(164), it > depends on your client character set as well as the database character set. > > Hope I'm making some sense :-) I think so. Thanks! Tim.