------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugs.exim.org/show_bug.cgi?id=791 --- Comment #2 from Martin Jerabek <[email protected]> 2008-12-17 09:24:02 --- On 16.12.2008 22:19, Philip Hazel wrote: > Aarrgghh! I have to admit that I *never* considered the possibility of > anybody wanting to use UTF-8 with EBCDIC, especially as "classical" > EBCDIC uses bytes with values greater than 127. (I was an OS/370 > programmer for many years, so I do have some EBCDIC experience. But not > in the last 15 years or so.) > The fact that EBCDIC uses codes > 127 does not really matter, only the fact that EBCDIC uses different codes than ASCII for characters which are present in both character sets. A bit of background: our products run on many platforms (Windows, various Unixen, z/OS), so we decided to use UTF-8 as the internal representation of our own string class on all platforms because this encoding is the only one which does not depend on code pages and byte order. >> - Convert the pattern characters to EBCDIC as they are retrieved with >> one of the GETCHAR macros. In this way they could be compared to C >> character literals. This would work for the meta characters but what >> about other characters from the pattern which cannot be represented in >> EBCDIC but only in Unicode? >> > > I do not understand how that could work, because I do not see how EBCDIC > and Unicode can be used simultaneously. Either your input is Unicode, or > it is EBCDIC, surely? Yes. In our case it will always be UTF-8 but in the general case the type of input depends on the PCRE_UTF8 flag. If it is set, the input is UTF-8, if not, it is EBCDIC (on EBCDIC platforms). > I *do* see that you could use UTF-8 to encode > codes in the range 0 to 255 and then interpret them as EBCDIC, but I > don't get it for codes greater than 255. The subset of Unicode whose > codepoints are < 256 is not the same set of characters as EBCDIC. > Consider, for example, the character A-with-grave-accent. The Unicode > codepoint is 00C0, but there is no such character in EBCDIC. So how > would you represent this character, because 00C0 *is* an EBCDIC > codepoint? > You are of course correct that there are many Unicode characters which cannot be represented in EBCDIC, so the conversion would be lossy. (Your example is actually a bad one because À and à do have EBCDIC codepoints: 0x64 and 0x44 but I get your point.) For the regular expression meta characters this would not matter because they have EBCDIC counterparts, otherwise PCRE would not work in EBCDIC at all. If the GETCHAR macros were only used for meta characters this could work but of course not if they are used for all characters because the conversion from UTF-8 to EBCDIC would lose information. > My suggestion is that you translate the other way, that is, you > translate everything from EBCDIC into Unicode before passing it to PCRE, > especially if you want to deal with characters that are not part of > EBCDIC (unless I've misunderstood what you meant above). We already do this by using only UTF-8 internally, even on EBCDIC platforms, and passing only UTF-8 strings with the PCRE_UTF8 flag. Exactly this causes the problem on EBCDIC platforms because PCRE compares the passed UTF-8 characters to character literals which are of course EBCDIC on z/OS. > Of course, you > would still have to fix the literals in PCRE. Hmmm. Not straightforward > at all. > Exactly. The idea with the lookup tables mentioned in my original post is still the best I can think of. I would replace all character constants with lookup table accesses. There would be two lookup tables, one for ASCII platforms and EBCDIC platforms in UTF-8 mode (PCRE_UTF8 set) containing the ASCII/UTF-8 codes of the characters, the other one for EBCDIC platforms in EBCDIC mode (PCRE_UTF8 not set) containing the EBCDIC codes. As mentioned before these tables would probably have to be user-configurable like the table generated with dftables because the EBCDIC codes depend on the code page in use. We always use EBCDIC code page 273 (Germany/Austria) in which, for example, the backslash has code 0xEC but in the international EBCDIC code pages 500 and 1047 it has code 0xE0. All this would only fix the problem if the C character literals are really the only cause of our difficulties. It would also require slight code modifications because the lookup tables could not be used in switch/case statements as it is possible with character literals. I will try this and report back whether it worked or not. > As I see it, the first thing to make a decision on is exactly what > characters you want to deal with and how to represent them - the > A-with-grave-accent problem that I mentioned above. > We already decided to use only UTF-8 so that we can represent all characters, so for our purposes it would be sufficient to replace all character constants with their ASCII/UTF-8 counterparts (e.g. '\x2A' instead of '*') because we never pass EBCDIC strings to PCRE. This modification could not be included in the official PCRE sources because it would make it unusable with EBCDIC strings. Thanks for your quick response! Martin Jerabek -- Configure bugmail: http://bugs.exim.org/userprefs.cgi?tab=email -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
