On Tue, 16 Dec 2008, Martin Jerabek wrote: > We use PCRE 7.8 on z/OS (formerly known as OS/390 and MVS) which uses > EBCDIC as its character encoding. We pass only UTF-8 strings to PCRE > using the PCRE_UTF8 flag, both as patterns and subject strings, > because our internal string class is based on UTF-8. We configured > PCRE with both --enable-utf8 and --enable-ebcdic, and also created the > appropriate EBCDIC character table with dftables.
Aarrgghh! I have to admit that I *never* considered the possibility of anybody wanting to use UTF-8 with EBCDIC, especially as "classical" EBCDIC uses bytes with values greater than 127. (I was an OS/370 programmer for many years, so I do have some EBCDIC experience. But not in the last 15 years or so.) > The real problem is now that the PCRE code compares the characters > taken from the pattern with normal C character literals such as '\\', > '*', etc. This works fine on non-EBCDIC (ASCII) platforms because the > 7-bit ASCII subset of UTF-8 is identical to ASCII. Exactly - I thought that was the whole point of UTF-8. > - Convert the pattern characters to EBCDIC as they are retrieved with > one of the GETCHAR macros. In this way they could be compared to C > character literals. This would work for the meta characters but what > about other characters from the pattern which cannot be represented in > EBCDIC but only in Unicode? I do not understand how that could work, because I do not see how EBCDIC and Unicode can be used simultaneously. Either your input is Unicode, or it is EBCDIC, surely? I *do* see that you could use UTF-8 to encode codes in the range 0 to 255 and then interpret them as EBCDIC, but I don't get it for codes greater than 255. The subset of Unicode whose codepoints are < 256 is not the same set of characters as EBCDIC. Consider, for example, the character A-with-grave-accent. The Unicode codepoint is 00C0, but there is no such character in EBCDIC. So how would you represent this character, because 00C0 *is* an EBCDIC codepoint? > I am writing this bug report mainly to get input how to solve the > problem best. I am aware that you probably do not have access to an > EBCDIC platform and that I would have to code a fix myself. Indeed. Our mainframe was shut down in 1994, and I thankfully stopped having to deal with EBCDIC. :-) My suggestion is that you translate the other way, that is, you translate everything from EBCDIC into Unicode before passing it to PCRE, especially if you want to deal with characters that are not part of EBCDIC (unless I've misunderstood what you meant above). Of course, you would still have to fix the literals in PCRE. Hmmm. Not straightforward at all. As I see it, the first thing to make a decision on is exactly what characters you want to deal with and how to represent them - the A-with-grave-accent problem that I mentioned above. Philip -- Philip Hazel -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
