UTF-8 has a variable number of bytes per character. Some encodings have 1 byte per character always. Some encodings have two bytes per character always. Assuming that the number of bytes is the same as the number of characters sometimes leads to wrong answers. When you are using perl as a client to oracle you may have characters in more than one encoding so you have to be very careful with the byte count and character count. The perl libraries which access Oracle database are built with a specific Oracle C Library implementation, and the rules for character encoding and character byte count are sometimes different from one version to another. In addition the perl rules for character counting and byte counting are different for different versions of perl. In addition the Oracle Database has its own idea of character encoding and byte count. And this also changes with version. So you are threading more than one needle at the same time - it can get (and usually does get) very complicated very quickly. The usual solution is to try to get all the layers to use the same encoding (and therefore the same number of bytes per character). UTF-8 (with Unicode Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all layers usually works, but if you don't have control over all the layers you may not be able to choose this. Another choice is to find out what character encoding the database is using and try to configure all other layers to use the same encoding. It is also helpful to know that the English letters A-Z and a-z and the digits 0-9 are often encoded the same way for many different encodings so sometimes you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping that all the layers will be "close enough". The OS also may have its own rules about character encoding so even the same version of perl on Windows and Linux may have different encoding rules.
The last time I tried to figure this out for a real life example I simply gave up trying to understand it and had to roll back the version of perl AND the DBI / DBD libraries (and therefore the version of Oracle C Client Libraries) to known working. To be absolutely 100 percent sure you get it right you need to understand the OS, perl, oracle client and oracle server rules inside and out - and frankly this is very hard. Even if you understand it all some layer may simply not be able to do what you want it to do. When the perl programming language was first created the idea was that a character was a byte and a byte was a character - things have changes since then - more than once. The same is true of the Oracle Database and client libraries - (and Windows and Linux) - the growing pains left a lot of strange artifacts behind in all. Another trick is make all buffers 4 times are big as you think you really need them and then have the database concat the data together on the server side - do in the end you are only working with one layer. Last time I checked 4 bytes per character was the worse case scenario. It might be 8 by now. There are also "lossless" ways of encoding "illegal characters", such as UTF-8 encode first then Mime Q-Encode (quoted printable) or using XML / XHTML entities like this - | or UTF-8 encode then percent-encode. This can be used work around special characters when you are dealing with mostly English with some European characters and an encoding that is either 7-bit-ascii compatible or can easily be converted to and from (losslessly) 7-bit-ascii (such as EBCIDIC with some creativity). And there is always good old hexadecimal. It all depends on your use-cases, how much control you have over your environment and how much time you can spend being clever. -----Original Message----- From: Bruce Johnson <john...@pharmacy.arizona.edu> Sent: Thursday, May 27, 2021 7:36 PM To: dbi users <dbi-users@perl.org> Subject: [EXTERNAL] Strange issue with inserting varchar2() I'm working on an app that processes (among other thing) long sections of text, and I’m running into odd multi-byte characters in some of the entries. The column in question is a varchar2(4000) so I am truncating the input to 4000 bytes. (via use “bytes; $string=substr($orig,0,4000); ” in the section where I actually truncate the string.) When I do the insert I get an ORA-12899: value too large for column "AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054, maximum: 4000) error When I check the actual length of the offending string in bytes on the perl side it is 4000 *bytes* long Original Oncology comment is 4804 chars (4824 bytes) long and when truncated in character mode to 4000 chars it’s 4018 bytes long, when truncated in byte mode, it’s 4000 chars and bytes. What’s confusing me is that the size of the string being reported by oracle in the error doesn’t match any of the values that I see in perl. I’m not sure this is a DBI issue or not. -- Bruce Johnson University of Arizona College of Pharmacy Information Technology Group Institutions do not have opinions, merely customs The information contained in this electronic mail transmission is intended only for the use of the individual or entity named in this transmission. If you are not the intended recipient of this transmission, you are hereby notified that any disclosure, copying or distribution of the contents of this transmission is strictly prohibited and that you should delete the contents of this transmission from your system immediately. Any comments or statements contained in this transmission do not necessarily reflect the views or position of Radial or its subsidiaries and/or affiliates.