[pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC platforms

Martin Jerabek Wed, 17 Dec 2008 05:20:07 -0800

------- You are receiving this mail because: -------
You are on the CC list for the bug.


http://bugs.exim.org/show_bug.cgi?id=791




--- Comment #2 from Martin Jerabek <[email protected]>  
2008-12-17 09:24:02 ---
On 16.12.2008 22:19, Philip Hazel wrote:
> Aarrgghh! I have to admit that I *never* considered the possibility of 
> anybody wanting to use UTF-8 with EBCDIC, especially as "classical" 
> EBCDIC uses bytes with values greater than 127. (I was an OS/370 
> programmer for many years, so I do have some EBCDIC experience. But not 
> in the last 15 years or so.)
>   
The fact that EBCDIC uses codes > 127 does not really matter, only the 
fact that EBCDIC uses different codes than ASCII for characters which 
are present in both character sets. A bit of background: our products 
run on many platforms (Windows, various Unixen, z/OS), so we decided to 
use UTF-8 as the internal representation of our own string class on all 
platforms because this encoding is the only one which does not depend on 
code pages and byte order.
>> - Convert the pattern characters to EBCDIC as they are retrieved with
>> one of the GETCHAR macros. In this way they could be compared to C
>> character literals. This would work for the meta characters but what
>> about other characters from the pattern which cannot be represented in
>> EBCDIC but only in Unicode?
>>     
>
> I do not understand how that could work, because I do not see how EBCDIC 
> and Unicode can be used simultaneously. Either your input is Unicode, or 
> it is EBCDIC, surely?
Yes. In our case it will always be UTF-8 but in the general case the 
type of input depends on the PCRE_UTF8 flag. If it is set, the input is 
UTF-8, if not, it is EBCDIC (on EBCDIC platforms).

> I *do* see that you could use UTF-8 to encode
> codes in the range 0 to 255 and then interpret them as EBCDIC, but I
> don't get it for codes greater than 255. The subset of Unicode whose
> codepoints are < 256 is not the same set of characters as EBCDIC.
> Consider, for example, the character A-with-grave-accent. The Unicode 
> codepoint is 00C0, but there is no such character in EBCDIC. So how 
> would you represent this character, because 00C0 *is* an EBCDIC 
> codepoint?
>   
You are of course correct that there are many Unicode characters which 
cannot be represented in EBCDIC, so the conversion would be lossy. (Your 
example is actually a bad one because À and à do have EBCDIC codepoints: 
0x64 and 0x44 but I get your point.) For the regular expression meta 
characters this would not matter because they have EBCDIC counterparts, 
otherwise PCRE would not work in EBCDIC at all. If the GETCHAR macros 
were only used for meta characters this could work but of course not if 
they are used for all characters because the conversion from UTF-8 to 
EBCDIC would lose information.
> My suggestion is that you translate the other way, that is, you 
> translate everything from EBCDIC into Unicode before passing it to PCRE, 
> especially if you want to deal with characters that are not part of 
> EBCDIC (unless I've misunderstood what you meant above).
We already do this by using only UTF-8 internally, even on EBCDIC 
platforms, and passing only UTF-8 strings with the PCRE_UTF8 flag. 
Exactly this causes the problem on EBCDIC platforms because PCRE 
compares the passed UTF-8 characters to character literals which are of 
course EBCDIC on z/OS.
> Of course, you 
> would still have to fix the literals in PCRE. Hmmm. Not straightforward 
> at all.
>   
Exactly. The idea with the lookup tables mentioned in my original post 
is still the best I can think of. I would replace all character 
constants with lookup table accesses. There would be two lookup tables, 
one for ASCII platforms and EBCDIC platforms in UTF-8 mode (PCRE_UTF8 
set) containing the ASCII/UTF-8 codes of the characters, the other one 
for EBCDIC platforms in EBCDIC mode (PCRE_UTF8 not set) containing the 
EBCDIC codes.

As mentioned before these tables would probably have to be 
user-configurable like the table generated with dftables because the 
EBCDIC codes depend on the code page in use. We always use EBCDIC code 
page 273 (Germany/Austria) in which, for example, the backslash has code 
0xEC but in the international EBCDIC code pages 500 and 1047 it has code 
0xE0.

All this would only fix the problem if the C character literals are 
really the only cause of our difficulties. It would also require slight 
code modifications because the lookup tables could not be used in 
switch/case statements as it is possible with character literals.

I will try this and report back whether it worked or not.
> As I see it, the first thing to make a decision on is exactly what 
> characters you want to deal with and how to represent them - the 
> A-with-grave-accent problem that I mentioned above.
>   
We already decided to use only UTF-8 so that we can represent all 
characters, so for our purposes it would be sufficient to replace all 
character constants with their ASCII/UTF-8 counterparts (e.g. '\x2A' 
instead of '*') because we never pass EBCDIC strings to PCRE. This 
modification could not be included in the official PCRE sources because 
it would make it unusable with EBCDIC strings.

Thanks for your quick response!
Martin Jerabek


-- 
Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email
-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

[pcre-dev] [Bug 791] UTF-8 support does not work on EBCDIC platforms

Reply via email to