Re: [pcre-dev] [Bug 791] New: UTF-8 support does not work on EBCDIC platforms

Philip Hazel Tue, 16 Dec 2008 13:19:08 -0800

On Tue, 16 Dec 2008, Martin Jerabek wrote:

> We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses
> EBCDIC as its character encoding. We pass only UTF-8 strings to PCRE
> using the PCRE_UTF8 flag, both as patterns and subject strings,
> because our internal string class is based on UTF-8. We configured
> PCRE with both --enable-utf8 and --enable-ebcdic, and also created the
> appropriate EBCDIC character table with dftables.


Aarrgghh! I have to admit that I *never* considered the possibility of 
anybody wanting to use UTF-8 with EBCDIC, especially as "classical" 
EBCDIC uses bytes with values greater than 127. (I was an OS/370 
programmer for many years, so I do have some EBCDIC experience. But not 
in the last 15 years or so.)

> The real problem is now that the PCRE code compares the characters
> taken from the pattern with normal C character literals such as '\\',
> '*', etc. This works fine on non-EBCDIC (ASCII) platforms because the
> 7-bit ASCII subset of UTF-8 is identical to ASCII.

Exactly - I thought that was the whole point of UTF-8.

> - Convert the pattern characters to EBCDIC as they are retrieved with
> one of the GETCHAR macros. In this way they could be compared to C
> character literals. This would work for the meta characters but what
> about other characters from the pattern which cannot be represented in
> EBCDIC but only in Unicode?

I do not understand how that could work, because I do not see how EBCDIC 
and Unicode can be used simultaneously. Either your input is Unicode, or 
it is EBCDIC, surely? I *do* see that you could use UTF-8 to encode
codes in the range 0 to 255 and then interpret them as EBCDIC, but I
don't get it for codes greater than 255. The subset of Unicode whose
codepoints are < 256 is not the same set of characters as EBCDIC.
Consider, for example, the character A-with-grave-accent. The Unicode 
codepoint is 00C0, but there is no such character in EBCDIC. So how 
would you represent this character, because 00C0 *is* an EBCDIC 
codepoint?

> I am writing this bug report mainly to get input how to solve the
> problem best. I am aware that you probably do not have access to an
> EBCDIC platform and that I would have to code a fix myself.

Indeed. Our mainframe was shut down in 1994, and I thankfully stopped 
having to deal with EBCDIC. :-)

My suggestion is that you translate the other way, that is, you 
translate everything from EBCDIC into Unicode before passing it to PCRE, 
especially if you want to deal with characters that are not part of 
EBCDIC (unless I've misunderstood what you meant above). Of course, you 
would still have to fix the literals in PCRE. Hmmm. Not straightforward 
at all.

As I see it, the first thing to make a decision on is exactly what 
characters you want to deal with and how to represent them - the 
A-with-grave-accent problem that I mentioned above.

Philip

-- 
Philip Hazel

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] [Bug 791] New: UTF-8 support does not work on EBCDIC platforms

Reply via email to