Here are a few articles if you really want to dig in

https://stackoverflow.com/questions/5290182/how-many-bytes-does-one-unicode-character-take

https://docs.oracle.com/database/121/NLSPG/ch2charset.htm#NLSPG1037

https://metacpan.org/pod/Text::Iconv

https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod

https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod



-----Original Message-----
From: Fennell, Brian <fenne...@radial.com>
Sent: Thursday, May 27, 2021 8:10 PM
To: Bruce Johnson <john...@pharmacy.arizona.edu>; dbi users <dbi-users@perl.org>
Subject: [EXTERNAL] RE: Strange issue with inserting varchar2()

UTF-8 has a variable number of bytes per character.  Some encodings have 1 byte 
per character always.  Some encodings have two bytes per character always.
Assuming that the number of bytes is the same as the number of characters 
sometimes leads to wrong answers.
When you are using perl as a client to oracle you may have characters in more 
than one encoding so you have to be very careful with the byte count and 
character count.
The perl libraries which access Oracle database are built with a specific 
Oracle C Library implementation, and the rules for character encoding and 
character byte count are sometimes different from one version to another.
In addition the perl rules for character counting and byte counting are 
different for different versions of perl.
In addition the Oracle Database has its own idea of character encoding and byte 
count.  And this also changes with version.
So you are threading more than one needle at the same time - it can get (and 
usually does get) very complicated very quickly.
The usual solution is to try to get all the layers to use the same encoding 
(and therefore the same number of bytes per character).  UTF-8 (with Unicode 
Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all 
layers usually works, but if you don't have control over all the layers you may 
not be able to choose this.
Another choice is to find out what character encoding the database is using and 
try to configure all other layers to use the same encoding.
It is also helpful to know that the English letters A-Z and a-z and the digits 
0-9 are often encoded the same way for many different encodings so sometimes 
you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping 
that all the layers will be "close enough".
The OS also may have its own rules about character encoding so even the same 
version of perl on Windows and Linux may have different encoding rules.

The last time I tried to figure this out for a real life example I simply gave 
up trying to understand it and had to roll back the version of perl AND the DBI 
/ DBD libraries (and therefore the version of Oracle C Client Libraries) to 
known working.

To be absolutely 100 percent sure you get it right you need to understand the 
OS, perl, oracle client and oracle server rules inside and out - and frankly 
this is very hard.
Even if you understand it all some layer may simply not be able to do what you 
want it to do.

When the perl programming language was first created the idea was that a 
character was a byte and a byte was a character - things have changes since 
then - more than once.
The same is true of the Oracle Database and client libraries - (and Windows and 
Linux) - the growing pains left a lot of strange artifacts behind in all.

Another trick is make all buffers 4 times are big as you think you really need 
them and then have the database concat the data together on the server side - 
do in the end you are only working with one layer.
Last time I checked 4 bytes per character was the worse case scenario.  It 
might be 8 by now.

There are also "lossless" ways of encoding "illegal characters", such as UTF-8 
encode first then Mime Q-Encode (quoted printable) or using XML / XHTML 
entities like this - &#124; or UTF-8 encode then percent-encode.
This can be used work around special characters when you are dealing with 
mostly English with some European characters and an encoding that is either 
7-bit-ascii compatible or can easily be converted to and from (losslessly) 
7-bit-ascii (such as EBCIDIC with some creativity).
And there is always good old hexadecimal.

It all depends on your use-cases, how much control you have over your 
environment and how much time you can spend being clever.




-----Original Message-----
From: Bruce Johnson <john...@pharmacy.arizona.edu>
Sent: Thursday, May 27, 2021 7:36 PM
To: dbi users <dbi-users@perl.org>
Subject: [EXTERNAL] Strange issue with inserting varchar2()

I'm working on an app that processes (among other thing) long sections of text, 
and I’m running into odd multi-byte characters in some of the entries.

The column in question is a varchar2(4000) so I am truncating the input to 4000 
bytes.

(via use “bytes; $string=substr($orig,0,4000); ” in the section where I 
actually truncate the string.)

When I do the insert I get an  ORA-12899: value too large for column 
"AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054, maximum: 4000) 
error

When I check the actual length of the offending string in bytes on the perl 
side it is 4000 *bytes* long

Original Oncology comment is 4804 chars (4824 bytes) long and when truncated in 
character mode to 4000 chars it’s 4018 bytes long, when truncated in byte mode, 
it’s 4000 chars and bytes.

What’s confusing me is that the size of the string being reported by oracle in 
the error doesn’t match any of the values that I see in perl. I’m not sure this 
is a DBI issue or not.

--
Bruce Johnson
University of Arizona
College of Pharmacy
Information Technology Group

Institutions do not have opinions, merely customs




The information contained in this electronic mail transmission is intended only 
for the use of the individual or entity named in this transmission. If you are 
not the intended recipient of this transmission, you are hereby notified that 
any disclosure, copying or distribution of the contents of this transmission is 
strictly prohibited and that you should delete the contents of this 
transmission from your system immediately. Any comments or statements contained 
in this transmission do not necessarily reflect the views or position of Radial 
or its subsidiaries and/or affiliates.





The information contained in this electronic mail transmission is intended only 
for the use of the individual or entity named in this transmission. If you are 
not the intended recipient of this transmission, you are hereby notified that 
any disclosure, copying or distribution of the contents of this transmission is 
strictly prohibited and that you should delete the contents of this 
transmission from your system immediately. Any comments or statements contained 
in this transmission do not necessarily reflect the views or position of Radial 
or its subsidiaries and/or affiliates.


Reply via email to