Re: Transliteration operator(tr//)on EBCDIC platform
Hi Sadahiro All the existing test suite passes. But there are couple of new tests failing probably due to multibyte representation \x{1000} which is represented in three byte sequence in EBCDIC . These two tests are $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/; is($c, 8); is($a, ); $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/; is($c, 8); is($a, ); The output is: not ok 1 # Failed at t/op/tr_new.t line 32 # got '6' # expected '8' not ok 2 # Failed at t/op/tr_new.t line 33 # got 'XXXðýXXX' # expected '' not ok 3 # Failed at t/op/tr_new.t line 36 # got '4' # expected '8' not ok 4 # Failed at t/op/tr_new.t line 37 # got 'XXôöòõXX' # expected '' One observation is that since this unicode appears first in the tr// as there seemed a problem in \x{100} case, Seems like it doesn't handle the multibyte (2) regards Sastry On 9/19/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Thu, 15 Sep 2005 18:31:43 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro Having incorporated the changes in the doop.c and op.c I strangely get lots of failures and here are the test results. Seems like the first approach itself fails on tr// and there will certainly more failures when we run the entire test suite which uses these functions. In the second approach, the change seems to be affecting only tr// . Please let me know your suggestions for the changes which I can apply in S_scan_const() and see if it works. regards Sastry Here it is. All newer codes in toke.t are enclosed between #ifdef EBCDIC and #endif since they are redundant for ASCII platform. And I add some tests to tr.t. Regards, SADAHIRO Tomoyuki ! toke.t t/op/tr.t diff -ur [EMAIL PROTECTED]/t/op/tr.t [EMAIL PROTECTED]/t/op/tr.t --- [EMAIL PROTECTED]/t/op/tr.tThu Aug 18 18:27:25 2005 +++ [EMAIL PROTECTED]/t/op/tr.t Sun Sep 18 19:59:13 2005 @@ -6,7 +6,7 @@ require './test.pl'; } -plan tests = 100; +plan tests = 120; my $Is_EBCDIC = (ord('i') == 0x89 ord('J') == 0xd1); @@ -259,7 +259,6 @@ # UTF8 range tests from Inaba Hiroto -# Not working in EBCDIC as of 12674. ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172,'UTF range'); @@ -272,6 +271,15 @@ ($a = \x{0100}) =~ tr/\x00-\x{100}/X/; is($a, X); +($a = \x{0100}) =~ tr/\x00-\x{101}/X/; +is($a, X); + +($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/; +is($a, XX); + +($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/; +is($a, XX); + ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c; is($a, X); @@ -303,8 +311,16 @@ is($c, 8); is($a, ); +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/; +is($c, 8); +is($a, ); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/; +is($c, 8); +is($a, ); + SKIP: { -skip not EBCDIC, 4 unless $Is_EBCDIC; +skip not EBCDIC, 12 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; is($c, 2); @@ -313,7 +329,38 @@ $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/; is($c, 2); is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); } + +($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c; +is($a, \xfc\xfd\xfe\xffX); ($a = \x{100}) =~ tr/\x00-\xff/X/c; is(ord($a), ord(X)); diff -ur [EMAIL PROTECTED]/toke.c [EMAIL PROTECTED]/toke.c --- [EMAIL PROTECTED]/toke.c Wed Sep 14 17:40:19 2005 +++ [EMAIL PROTECTED]/toke.c Mon Sep 19 12:05:41 2005 @@ -1407,6 +1407,7 @@ UV uv; #ifdef EBCDIC UV literal_endpoint = 0; +bool native_range = TRUE; /* turned to FALSE if the first endpoint is Unicode */ #endif const char *leaveit = /* set of acceptably-backslashed characters */ @@ -1429,8 +1430,14 @@ I32 i; /* current expanded character */ I32 min;/* first character in range */ I32 max;/* last character in range */ -
Re: Transliteration operator(tr//)on EBCDIC platform
On Tue, 20 Sep 2005 15:51:34 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro All the existing test suite passes. But there are couple of new tests failing probably due to multibyte representation \x{1000} which is represented in three byte sequence in EBCDIC . These two tests are $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/; is($c, 8); is($a, ); $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/; is($c, 8); is($a, ); The output is: not ok 1 # Failed at t/op/tr_new.t line 32 # got '6' # expected '8' not ok 2 # Failed at t/op/tr_new.t line 33 # got 'XXXðýXXX' # expected '' not ok 3 # Failed at t/op/tr_new.t line 36 # got '4' # expected '8' not ok 4 # Failed at t/op/tr_new.t line 37 # got 'XXôöòõXX' # expected '' One observation is that since this unicode appears first in the tr// as there seemed a problem in \x{100} case, Seems like it doesn't handle the multibyte (2) regards Sastry This newer patch uses NATIVE_TO_ASCII(i) instead of NATIVE_TO_UTF(i). This is only thing which I found being wrong about the prev patch; but your result seems different from my expectation about how the output will be with NATIVE_TO_UTF(i) in the prev patch... If newer patch is still wrong, would you set DEBUG in lib/utf8_heavy.pl to be true (that is to replace the line 5 sub DEBUG () { 0 } to sub DEBUG () { 1 } and run it again? Then many verbose info will be out. Regards, SADAHIRO Tomoyuki diff -ur [EMAIL PROTECTED]/t/op/tr.t perl/t/op/tr.t --- [EMAIL PROTECTED]/t/op/tr.t Thu Aug 18 18:27:25 2005 +++ perl/t/op/tr.t Sun Sep 18 19:59:13 2005 @@ -6,7 +6,7 @@ require './test.pl'; } -plan tests = 100; +plan tests = 120; my $Is_EBCDIC = (ord('i') == 0x89 ord('J') == 0xd1); @@ -259,7 +259,6 @@ # UTF8 range tests from Inaba Hiroto -# Not working in EBCDIC as of 12674. ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172,'UTF range'); @@ -272,6 +271,15 @@ ($a = \x{0100}) =~ tr/\x00-\x{100}/X/; is($a, X); +($a = \x{0100}) =~ tr/\x00-\x{101}/X/; +is($a, X); + +($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/; +is($a, XX); + +($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/; +is($a, XX); + ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c; is($a, X); @@ -303,8 +311,16 @@ is($c, 8); is($a, ); +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/; +is($c, 8); +is($a, ); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/; +is($c, 8); +is($a, ); + SKIP: { -skip not EBCDIC, 4 unless $Is_EBCDIC; +skip not EBCDIC, 12 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; is($c, 2); @@ -313,7 +329,38 @@ $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/; is($c, 2); is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); } + +($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c; +is($a, \xfc\xfd\xfe\xffX); ($a = \x{100}) =~ tr/\x00-\xff/X/c; is(ord($a), ord(X)); diff -ur [EMAIL PROTECTED]/toke.c perl/toke.c --- [EMAIL PROTECTED]/toke.cWed Sep 14 17:40:19 2005 +++ perl/toke.c Tue Sep 20 23:09:13 2005 @@ -1407,6 +1407,7 @@ UV uv; #ifdef EBCDIC UV literal_endpoint = 0; +bool native_range = TRUE; /* turned to FALSE if the first endpoint is Unicode */ #endif const char *leaveit = /* set of acceptably-backslashed characters */ @@ -1429,8 +1430,14 @@ I32 i; /* current expanded character */ I32 min;/* first character in range */ I32 max;/* last character in range */ - - if (has_utf8) { +#ifdef EBCDIC + UV uvmax = 0; /* last character above byte */ +#endif + if (has_utf8 +#ifdef EBCDIC +!native_range +#endif + ) { char * const c = (char*)utf8_hop((U8*)d, -1); char *e = d++; while (e-- c) @@ -1443,12
Re: Transliteration operator(tr//)on EBCDIC platform
On Thu, 15 Sep 2005 18:31:43 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro Having incorporated the changes in the doop.c and op.c I strangely get lots of failures and here are the test results. Seems like the first approach itself fails on tr// and there will certainly more failures when we run the entire test suite which uses these functions. In the second approach, the change seems to be affecting only tr// . Please let me know your suggestions for the changes which I can apply in S_scan_const() and see if it works. regards Sastry Here it is. All newer codes in toke.t are enclosed between #ifdef EBCDIC and #endif since they are redundant for ASCII platform. And I add some tests to tr.t. Regards, SADAHIRO Tomoyuki ! toke.t t/op/tr.t diff -ur [EMAIL PROTECTED]/t/op/tr.t [EMAIL PROTECTED]/t/op/tr.t --- [EMAIL PROTECTED]/t/op/tr.t Thu Aug 18 18:27:25 2005 +++ [EMAIL PROTECTED]/t/op/tr.t Sun Sep 18 19:59:13 2005 @@ -6,7 +6,7 @@ require './test.pl'; } -plan tests = 100; +plan tests = 120; my $Is_EBCDIC = (ord('i') == 0x89 ord('J') == 0xd1); @@ -259,7 +259,6 @@ # UTF8 range tests from Inaba Hiroto -# Not working in EBCDIC as of 12674. ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172,'UTF range'); @@ -272,6 +271,15 @@ ($a = \x{0100}) =~ tr/\x00-\x{100}/X/; is($a, X); +($a = \x{0100}) =~ tr/\x00-\x{101}/X/; +is($a, X); + +($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/; +is($a, XX); + +($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/; +is($a, XX); + ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c; is($a, X); @@ -303,8 +311,16 @@ is($c, 8); is($a, ); +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/; +is($c, 8); +is($a, ); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/; +is($c, 8); +is($a, ); + SKIP: { -skip not EBCDIC, 4 unless $Is_EBCDIC; +skip not EBCDIC, 12 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; is($c, 2); @@ -313,7 +329,38 @@ $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/; is($c, 2); is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); + +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/; +is($c, 2); +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X); + +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/; +is($c, 2); +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X); } + +($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/; +is($a, ); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/; +is($a, X); + +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c; +is($a, \xfc\xfd\xfe\xffX); ($a = \x{100}) =~ tr/\x00-\xff/X/c; is(ord($a), ord(X)); diff -ur [EMAIL PROTECTED]/toke.c [EMAIL PROTECTED]/toke.c --- [EMAIL PROTECTED]/toke.cWed Sep 14 17:40:19 2005 +++ [EMAIL PROTECTED]/toke.cMon Sep 19 12:05:41 2005 @@ -1407,6 +1407,7 @@ UV uv; #ifdef EBCDIC UV literal_endpoint = 0; +bool native_range = TRUE; /* turned to FALSE if the first endpoint is Unicode */ #endif const char *leaveit = /* set of acceptably-backslashed characters */ @@ -1429,8 +1430,14 @@ I32 i; /* current expanded character */ I32 min;/* first character in range */ I32 max;/* last character in range */ - - if (has_utf8) { +#ifdef EBCDIC + UV uvmax = 0; /* last character above byte */ +#endif + if (has_utf8 +#ifdef EBCDIC +!native_range +#endif + ) { char * const c = (char*)utf8_hop((U8*)d, -1); char *e = d++; while (e-- c) @@ -1443,12 +1450,41 @@ } i = d - SvPVX_const(sv);/* remember current offset */ +#ifdef EBCDIC + SvGROW(sv, SvLEN(sv) + (has_utf8 + ? (512 - UTF_CONTINUATION_MARK + UNISKIP(0x100)) + : 256)); + /* how many two-byte within 0..255: 128 in UTF-8, 96 in UTF-8-mod */ +#else SvGROW(sv, SvLEN(sv) + 256);/* never more than 256 chars in a range */ +#endif d = SvPVX(sv) + i; /* refresh d after realloc */ - d -= 2; /* eat the first char and the - */ +#ifdef EBCDIC + if (has_utf8) { +
Re: Transliteration operator(tr//)on EBCDIC platform
Hi Sadahiro Having incorporated the changes in the doop.c and op.c I strangely get lots of failures and here are the test results. Seems like the first approach itself fails on tr// and there will certainly more failures when we run the entire test suite which uses these functions. In the second approach, the change seems to be affecting only tr// . Please let me know your suggestions for the changes which I can apply in S_scan_const() and see if it works. regards Sastry # Failed at t/op/tr.t line 110 # got 'š\'' Wide character in print at ./test.pl line 48. # expected 'Œã\'' # Failed at t/op/tr.t line 209 Wide character in print at ./test.pl line 48. # got '¯œD–㯜D–ã' Wide character in print at ./test.pl line 48. # expected '¯œ¯Û–㯜¯Û–ã' # Failed at t/op/tr.t line 219 # got 'CDÚCDÚ' Wide character in print at ./test.pl line 48. # expected 'C¯Û–ãC¯Û–ã' # Failed at t/op/tr.t line 224 Wide character in print at ./test.pl line 48. # got 'ED–ãED–㌨Føã' Wide character in print at ./test.pl line 48. # expected 'E¯Û[E¯Û[Œ¨Føã' # Failed at t/op/tr.t line 234 Wide character in print at ./test.pl line 48. # got '¯Û¯Û¯Û¯Û¯Û¯Û' Wide character in print at ./test.pl line 48. # expected '¯ÛD¯Û¯ÛD¯Û' # Failed at t/op/tr.t line 283 Wide character in print at ./test.pl line 48. # got '¯œD–㯥E–ã' Wide character in print at ./test.pl line 48. # expected '¯œ¯œ–㯥¯Û–ã' # Failed at t/op/tr.t line 350 # got '§ÿ' Wide character in print at ./test.pl line 48. # expected 'ΰÎ' 1..99 ok 1 - uc ok 2 - lc ok 3 - partial uc ok 4 - EBCDIC discontinuity ok 5 - tr cancels IOK and NOK ok 6 - harmless if explicitly not updating ok 7 - harmless if implicitly not updating ok 8 - no error ok 9 - handles UTF8 ok 10 ok 11 ok 12 ok 13 ok 14 ok 15 ok 16 ok 17 - changing UTF8 chars in a UTF8 string, same length ok 18 ok 19 - more bytes ok 20 not ok 21 - Putting UT8 chars into a non-UTF8 string ok 22 ok 23 - Removing UTF8 chars from UTF8 string ok 24 ok 25 - Counting UTF8 chars in UTF8 string ok 26 - non-UTF8 chars in UTF8 string ok 27 - UTF8 chars in non-UTFs string ok 28 - tr/a-z-9// ok 29 - hyphens, leading ok 30 -trailing ok 31 -both ok 32 ok 33 ok 34 ok 35 - reversed range check ok 36 - cannot update read-only var ok 37 - explicit read-only count ok 38 - no error ok 39 - implicit read-only count ok 40 - no error ok 41 - LHS of non-updating tr ok 42 - LHS bad on updating tr ok 43 - byte2byte transliteration ok 44 ok 45 ok 46 not ok 47 - byte2wide transliteration ok 48 -wide2byte ok 49 -wide2wide not ok 50 - byte2wide wide2byte not ok 51 - all together now! ok 52 - transliterate and count ok 53 not ok 54 - translit w/complement ok 55 ok 56 - translit w/deletion ok 57 ok 58 - translit w/squeeze ok 59 ok 60 ok 61 ok 62 ok 63 - UTF range not ok 64 ok 65 ok 66 ok 67 ok 68 ok 69 ok 70 ok 71 ok 72 ok 73 ok 74 ok 75 ok 76 ok 77 ok 78 ok 79 ok 80 ok 81 ok 82 not ok 83 ok 84 ok 85 ok 86 ok 87 ok 88 - pp_trans needs to unshare shared hash keys ok 89 -no error ok 90 - implicit count on constant ok 91 -no error ok 92 - implicit count outside array bounds, index negative ok 93 - doesn't extend the array ok 94 - implicit count outside array bounds, index positive ok 95 - doesn't extend the array ok 96 - implicit count outside hash bounds ok 97 - doesn't extend the hash ok 98 - non-modifying tr/// on a scalar ref ok 99 - doesn't stringify its argument On 9/14/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Wed, 14 Sep 2005 16:50:26 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to such an ambiguity of \xc0-\xc4. In this expression the left part \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode, and results in the failure. So this is still a problem on EBCDIC! Is there a way to fix this? #test case B # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/; is($c, 8); is($a, ); This test fails on EBCDIC. In S_scan_const(), there is a statement below. /* Insert oct or hex escaped character. * There will always enough room in sv since such * escapes will be longer than any UTF-8 sequence * they can end up as. */ /* We need to map to chars to ASCII before doing the tests to cover EBCDIC */ if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) { if (!has_utf8 uv 255) { on an ASCII , the first if condition is true as uv is 137 and it falls in the variant range as uv \x7F whereas on EBCDIC the if condition is false. Can you explain why this behaviour is? see else for this if. This condition tests whether uv needs
Re: Transliteration operator(tr//)on EBCDIC platform
On Wed, 14 Sep 2005 16:50:26 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to such an ambiguity of \xc0-\xc4. In this expression the left part \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode, and results in the failure. So this is still a problem on EBCDIC! Is there a way to fix this? #test case B # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/; is($c, 8); is($a, ); This test fails on EBCDIC. In S_scan_const(), there is a statement below. /* Insert oct or hex escaped character. * There will always enough room in sv since such * escapes will be longer than any UTF-8 sequence * they can end up as. */ /* We need to map to chars to ASCII before doing the tests to cover EBCDIC */ if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) { if (!has_utf8 uv 255) { on an ASCII , the first if condition is true as uv is 137 and it falls in the variant range as uv \x7F whereas on EBCDIC the if condition is false. Can you explain why this behaviour is? see else for this if. This condition tests whether uv needs multiple octets in UTF-8/UTF-EBCDIC or only needs a single octet. \x89 in Latin-1 corresponds to a double-octet representation in UTF-8, and true (that needs multiple octets) on ASCII platform. \x89 in EBCDIC corresponds to a single-octet representation in UTF-EBCDIC, and false on EBCDIC platform. Where else runs, there is no difference between ASCII and UTF-8; or between single-octet EBCDIC and UTF-EBCDIC. Also I found that the characters are expanded during runtime in S_do_trans_simple_utf8() If I understand it correctly, expansion of character ranges isn't performed in do_trans_simple_utf8(). It is performed in scan_const() for non-Unicode and pmtrans() for Unicode. Do you have any suggestion where the problem is? (1) one way (I think worse) Perl should treat the range in the native order (not in Unicode one) through the parse time, the compile time, and the run time. using uvchr_to_utf8() instead of uvuni_to_utf8(), utf8n_to_uvchr() instead of utf8n_to_uvuni(), in op.c#pmtrans and doop.c#do_trans_simple_utf8 etc. But swash_fetch() also needs change (the current swash does not know EBCDIC, only Unicode); changes of swash may lead to corruption of lc(), uc(), regular expression \p{something} etc. (2) another way (I think better) No change of swash, pmtrans, do_trans_. Then all character ranges within 0..255 (not only for non-Unicode but also for Unicode) to be expanded in scan_const(). (and pmtrans() will expand only uv = 256). I think this way requires only the change of toke.c#scan_const and influences only tr///. But the change will be quite big, since the current scan_const() only expands non-Unicode and assumes a single octet encoding. The range 0..255 in UTF-8/UTF-EBCDIC includes double-octet characters. I'm not sure whether such a change should be enclosed with #ifdef EBCDIC and #endif Regards, SADAHIRO Tomoyuki
Re: Transliteration operator(tr//)on EBCDIC platform
Hi Sadahiro On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Mon, 12 Sep 2005 16:12:45 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: Do you think that perl-5.8.6 is not expanding the character ranges with Unicode? If so how is this test case working? ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/; All the bytes are translated to Y regards -Sastry Beyond 255 (\x{ff}), I think it will be correctly expanded. \x{12c}-\x{130} is beyond 255, and thus no problem. In the range of 0..255 (inclusive), I think generally no for EBCDIC. (Why I don't say always no is that there are some cases where a character range in EBCDIC coincides with that in Unicode: for example 0-9 can be successfully expanded into 0123456789 in both encodings) I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to such an ambiguity of \xc0-\xc4. In this expression the left part \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode, and results in the failure. So this is still a problem on EBCDIC! Is there a way to fix this? In contrast, I attribute the success in tr/\xc0-\xc4/\x{12c}-\x{130}/; to that \xc0-\xc4 is parsed before \x{12c}-\x{130}, and then \xc0-\xc4 is expanded into \xc0\xc1\xc2\xc3\xc4 as EBCDIC before the character list is coerced into Unicode. Well, how about the tese case B? (It has \x{100} at first and then both sides are coerced into Unicode.) #test case A # now resolved $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($c, 8); is($a, ); #test case B # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/; is($c, 8); is($a, ); This test fails on EBCDIC. In S_scan_const(), there is a statement below. /* Insert oct or hex escaped character. * There will always enough room in sv since such * escapes will be longer than any UTF-8 sequence * they can end up as. */ /* We need to map to chars to ASCII before doing the tests to cover EBCDIC */ if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) { if (!has_utf8 uv 255) { on an ASCII , the first if condition is true as uv is 137 and it falls in the variant range as uv \x7F whereas on EBCDIC the if condition is false. Can you explain why this behaviour is? Also I found that the characters are expanded during runtime in S_do_trans_simple_utf8() Do you have any suggestion where the problem is? I think the current perl on EBCDIC does not translate gap characters for the test case B, which works like tr/\x{100}i-j/X/ and results in $c == 2, and $a eq X\x8a\x8b\x8c\x8d\x8f\x90X; because i's next character is j in Unicode. It expands the range but doesn't translate. And then try this: #test case C # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91\x{100}/X/; is($c, 8); is($a, ); This works fine I think the test case C would success even on EBCDIC, because the expansion from \x89-\x91 to \x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91 will be done before the parser finds \x{100}. Regards, SADAHIRO Tomoyuki regards Sastry --
Re: Transliteration operator(tr//)on EBCDIC platform
Hi Sadahiro On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Wed, 31 Aug 2005 19:53:37 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro The patch has resolved four tests that were failing previously but one more test is stilling failing(which was failing even before applying the patch). Here is the test case ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172, 'UTF range'); # got 'DÐDEÐ' # expected '{DÐBEÐ' Can you suggest some pointers towards fixing this? -Sastry This EBCDIC-specific problem is based on how to treat with code values including Unicode (\x{12c}-\x{130} is surely Unicode) on EBCDIC platform. Native code values in EBCDIC (for example 'A' == 193) almost differs from the range of 0..255 in Unicode (for example 'A' == 65) which coincides with ASCII/Latin1. Thus the middle part of a character range is gererally different between EBCDIC and Unicode. For example consider a character range \xc0-\xc4. Since the mappings \xc0 to '{' (an open curly) and \xc4 to D in EBCDIC are definite, the range \xc0-\xc4 is equivalent to {-D on EBCDIC platform. In EBCDIC {-D (\xc0-\xc4) can be expanded to \xc0\xc1\xc2\xc3\xc4, but in Unicode {-D cannot be expanded, as the Unicode scalar values of the endpoints are reverse ('{' = U+007B, D = U+0044). Actually the current perl implementation is confused: in the parse time (see toke.c#scan_const) perl treats the range in EBCDIC order and then does not catch as Invalid range, though in the compile time (see op.c#pmtrans) and the run time (see doop.c#do_trans_simple_utf8 and its friends) perl treats the range in Unicode order and then generates a strange result. For this test since the min max in scan_const, as per their Unicode values, should we complain warning, in which case the test case is wrong in EBCDIC platform! Am I correct? In my opinion it is necessary to determine how to expand character ranges with Unicode (whether the native EBCDIC order or Unicode order). I'm not sure using the native encoding (ASCII/Latin1/EBCDIC) everytime (that is same as no Unicode within 0..255) makes people happy. Do you think that perl-5.8.6 is not expanding the character ranges with Unicode? If so how is this test case working? ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/; All the bytes are translated to Y regards -Sastry Regards, SADAHIRO Tomoyuki
Re: Transliteration operator(tr//)on EBCDIC platform
On Mon, 12 Sep 2005 16:12:45 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: Do you think that perl-5.8.6 is not expanding the character ranges with Unicode? If so how is this test case working? ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/; All the bytes are translated to Y regards -Sastry Beyond 255 (\x{ff}), I think it will be correctly expanded. \x{12c}-\x{130} is beyond 255, and thus no problem. In the range of 0..255 (inclusive), I think generally no for EBCDIC. (Why I don't say always no is that there are some cases where a character range in EBCDIC coincides with that in Unicode: for example 0-9 can be successfully expanded into 0123456789 in both encodings) I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to such an ambiguity of \xc0-\xc4. In this expression the left part \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode, and results in the failure. In contrast, I attribute the success in tr/\xc0-\xc4/\x{12c}-\x{130}/; to that \xc0-\xc4 is parsed before \x{12c}-\x{130}, and then \xc0-\xc4 is expanded into \xc0\xc1\xc2\xc3\xc4 as EBCDIC before the character list is coerced into Unicode. Well, how about the tese case B? (It has \x{100} at first and then both sides are coerced into Unicode.) #test case A # now resolved $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($c, 8); is($a, ); #test case B # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/; is($c, 8); is($a, ); I think the current perl on EBCDIC does not translate gap characters for the test case B, which works like tr/\x{100}i-j/X/ and results in $c == 2, and $a eq X\x8a\x8b\x8c\x8d\x8f\x90X; because i's next character is j in Unicode. And then try this: #test case C # On ASCII platform, of course successful $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91\x{100}/X/; is($c, 8); is($a, ); I think the test case C would success even on EBCDIC, because the expansion from \x89-\x91 to \x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91 will be done before the parser finds \x{100}. Regards, SADAHIRO Tomoyuki
Re: Transliteration operator(tr//)on EBCDIC platform
On Wed, 31 Aug 2005 19:53:37 +0530, Sastry [EMAIL PROTECTED] wrote Hi Sadahiro The patch has resolved four tests that were failing previously but one more test is stilling failing(which was failing even before applying the patch). Here is the test case ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172, 'UTF range'); # got 'DÐDEÐ' # expected '{DÐBEÐ' Can you suggest some pointers towards fixing this? -Sastry This EBCDIC-specific problem is based on how to treat with code values including Unicode (\x{12c}-\x{130} is surely Unicode) on EBCDIC platform. Native code values in EBCDIC (for example 'A' == 193) almost differs from the range of 0..255 in Unicode (for example 'A' == 65) which coincides with ASCII/Latin1. Thus the middle part of a character range is gererally different between EBCDIC and Unicode. For example consider a character range \xc0-\xc4. Since the mappings \xc0 to '{' (an open curly) and \xc4 to D in EBCDIC are definite, the range \xc0-\xc4 is equivalent to {-D on EBCDIC platform. In EBCDIC {-D (\xc0-\xc4) can be expanded to \xc0\xc1\xc2\xc3\xc4, but in Unicode {-D cannot be expanded, as the Unicode scalar values of the endpoints are reverse ('{' = U+007B, D = U+0044). Actually the current perl implementation is confused: in the parse time (see toke.c#scan_const) perl treats the range in EBCDIC order and then does not catch as Invalid range, though in the compile time (see op.c#pmtrans) and the run time (see doop.c#do_trans_simple_utf8 and its friends) perl treats the range in Unicode order and then generates a strange result. In my opinion it is necessary to determine how to expand character ranges with Unicode (whether the native EBCDIC order or Unicode order). I'm not sure using the native encoding (ASCII/Latin1/EBCDIC) everytime (that is same as no Unicode within 0..255) makes people happy. Regards, SADAHIRO Tomoyuki
Re: Fw: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform
Hi Sadahiro The patch has resolved four tests that were failing previously but one more test is stilling failing(which was failing even before applying the patch). Here is the test case ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/; is($a, v192.196.172.194.197.172, 'UTF range'); # got 'DÐDEÐ' # expected '{DÐBEÐ' Can you suggest some pointers towards fixing this? -Sastry On 8/16/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: perl5 porters, There is a response in approval from Sastry to my proposed patch. I'll forward it and now submit the proposal (on my prev mail) to p5p. Regards, SADAHIRO Tomoyuki Forwarded by SADAHIRO Tomoyuki [EMAIL PROTECTED] --- Original Message --- From: Sastry [EMAIL PROTECTED] To: SADAHIRO Tomoyuki [EMAIL PROTECTED] Date: Tue, 16 Aug 2005 15:27:45 +0530 Subject: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform Hi The patch works now as expected. Thanks -Sastry On 8/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das [EMAIL PROTECTED] wrote Hi, This is Rajarshi expressing Sastry's viewpoints since he's on vacation. SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: According to the above statement in perlebcdic.pod, s/[\x89-\x91]/X/g must substitute \x8e with X. But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e with X or not, since tr/// does not use brackets, [ ]. Though I think ranges in [ ] and ranges in tr/// should coincide and agree that tr/\x89-\x91/X/ should substitute \x8e with X, that is just my opinion. I don't know whether it is true and correct. Is there some way we can confirm if this is correct (and expected behaviour) since there isnt any explicit documentation for the tr operator ? Since t/op/tr.t already has a test case (cf. Change 9038) which Sastry previously pointed out its failing on EBCDIC Platform, I assume that at least the then pumpking thought it to be correct. By the way, when you say If I specify [\x89-\x91], does it mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused. We mean tr/\x89-\x91/X/. We are first informed by you that gapped characters are not substituted with X by tr/\x89-\x91/X/. And you said s/[\x89-\x91]/X/g substituted all the characters including gapped characters with X, hadn't you? Yes. If so, I assume your [\x89-\x91] which doesn't matching any of the gapped characters to be tr/\x89-\x91/X/. That's correct. We mean tr/\x89-\x91/X/. The following is a part of the current core tests from op/pat.t. I believe they should be passed. Yes all the following tests pass. I think the following tests are in the context of the s/[]/X/ operator and hence pass. Thanks, Rajarshi. OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem. Since I don't have any EBCDIC machine, I can't ensure the following patch will really makes sense. Regards, SADAHIRO Tomoyuki ! t/op/tr.t, toke.t diff -ur perl~/t/op/tr.t perl/t/op/tr.t --- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005 +++ perl/t/op/tr.t Thu Aug 11 23:41:22 2005 @@ -295,18 +295,15 @@ # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them, # from Karsten Sperling. -# Not working in EBCDIC as of 12674. $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($c, 8); is($a, ); - -# Not working in EBCDIC as of 12674. + $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/; is($c, 8); is($a, ); - -SKIP: { +SKIP: { skip not EBCDIC, 4 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; diff -ur perl~/toke.c perl/toke.c --- perl~/toke.c Mon Jul 18 04:31:02 2005 +++ perl/toke.c Thu Aug 11 22:55:18 2005 @@ -1368,6 +1368,9 @@ I32 has_utf8 = FALSE; /* Output constant is UTF8 */ I32 this_utf8 = UTF; /* The source string is assumed to be UTF8 */ UV uv; +#ifdef EBCDIC + UV literal_endpoint = 0; +#endif const char *leaveit = /* set of acceptably-backslashed characters */ PL_lex_inpat @@ -1417,8 +1420,9 @@ } #ifdef EBCDIC - if ((isLOWER(min) isLOWER(max)) || - (isUPPER(min) isUPPER(max))) { + if (literal_endpoint == 2 + ((isLOWER(min) isLOWER(max)) || + (isUPPER(min) isUPPER(max { if (isLOWER(min)) { for (i = min; i = max; i++) if (isLOWER(i)) @@ -1437,6 +1441,9 @@ /* mark the range as done, and continue */ dorange = FALSE; didrange = TRUE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif continue; } @@ -1455,6 +1462,9 @@ } else { didrange = FALSE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif } } @@ -1788,6 +1798,10 @@ s++; continue; } /* end if (backslash) */ +#ifdef EBCDIC + else + literal_endpoint++; +#endif
Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform
SADAHIRO Tomoyuki wrote: perl5 porters, There is a response in approval from Sastry to my proposed patch. I'll forward it and now submit the proposal (on my prev mail) to p5p. Thanks, applied as change #25303 to bleadperl.
Fw: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform
perl5 porters, There is a response in approval from Sastry to my proposed patch. I'll forward it and now submit the proposal (on my prev mail) to p5p. Regards, SADAHIRO Tomoyuki Forwarded by SADAHIRO Tomoyuki [EMAIL PROTECTED] --- Original Message --- From:Sastry [EMAIL PROTECTED] To: SADAHIRO Tomoyuki [EMAIL PROTECTED] Date:Tue, 16 Aug 2005 15:27:45 +0530 Subject: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform Hi The patch works now as expected. Thanks -Sastry On 8/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das [EMAIL PROTECTED] wrote Hi, This is Rajarshi expressing Sastry's viewpoints since he's on vacation. SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: According to the above statement in perlebcdic.pod, s/[\x89-\x91]/X/g must substitute \x8e with X. But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e with X or not, since tr/// does not use brackets, [ ]. Though I think ranges in [ ] and ranges in tr/// should coincide and agree that tr/\x89-\x91/X/ should substitute \x8e with X, that is just my opinion. I don't know whether it is true and correct. Is there some way we can confirm if this is correct (and expected behaviour) since there isnt any explicit documentation for the tr operator ? Since t/op/tr.t already has a test case (cf. Change 9038) which Sastry previously pointed out its failing on EBCDIC Platform, I assume that at least the then pumpking thought it to be correct. By the way, when you say If I specify [\x89-\x91], does it mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused. We mean tr/\x89-\x91/X/. We are first informed by you that gapped characters are not substituted with X by tr/\x89-\x91/X/. And you said s/[\x89-\x91]/X/g substituted all the characters including gapped characters with X, hadn't you? Yes. If so, I assume your [\x89-\x91] which doesn't matching any of the gapped characters to be tr/\x89-\x91/X/. That's correct. We mean tr/\x89-\x91/X/. The following is a part of the current core tests from op/pat.t. I believe they should be passed. Yes all the following tests pass. I think the following tests are in the context of the s/[]/X/ operator and hence pass. Thanks, Rajarshi. OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem. Since I don't have any EBCDIC machine, I can't ensure the following patch will really makes sense. Regards, SADAHIRO Tomoyuki ! t/op/tr.t, toke.t diff -ur perl~/t/op/tr.t perl/t/op/tr.t --- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005 +++ perl/t/op/tr.t Thu Aug 11 23:41:22 2005 @@ -295,18 +295,15 @@ # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them, # from Karsten Sperling. -# Not working in EBCDIC as of 12674. $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($c, 8); is($a, ); - -# Not working in EBCDIC as of 12674. + $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/; is($c, 8); is($a, ); - -SKIP: { +SKIP: { skip not EBCDIC, 4 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; diff -ur perl~/toke.c perl/toke.c --- perl~/toke.cMon Jul 18 04:31:02 2005 +++ perl/toke.c Thu Aug 11 22:55:18 2005 @@ -1368,6 +1368,9 @@ I32 has_utf8 = FALSE; /* Output constant is UTF8 */ I32 this_utf8 = UTF; /* The source string is assumed to be UTF8 */ UV uv; +#ifdef EBCDIC +UV literal_endpoint = 0; +#endif const char *leaveit = /* set of acceptably-backslashed characters */ PL_lex_inpat @@ -1417,8 +1420,9 @@ } #ifdef EBCDIC - if ((isLOWER(min) isLOWER(max)) || - (isUPPER(min) isUPPER(max))) { + if (literal_endpoint == 2 + ((isLOWER(min) isLOWER(max)) || +(isUPPER(min) isUPPER(max { if (isLOWER(min)) { for (i = min; i = max; i++) if (isLOWER(i)) @@ -1437,6 +1441,9 @@ /* mark the range as done, and continue */ dorange = FALSE; didrange = TRUE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif continue; } @@ -1455,6 +1462,9 @@ } else { didrange = FALSE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif } } @@ -1788,6 +1798,10 @@ s++; continue; } /* end if (backslash) */ +#ifdef EBCDIC + else + literal_endpoint++; +#endif default_action: /* If we started with encoded form, or already know we want it ###END OF PATCH
Re: Transliteration operator(tr//)on EBCDIC platform
Hi, This is Rajarshi expressing Sastry's viewpoints since he's on vacation. SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: On Wed, 10 Aug 2005 14:06:56 +0530, Sastry wrote As suggested by you, I ran the following script which resulted in substituting all the characters with X irrespective of the special case [i-j]. ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g; is($a, ); +++quote begin REGULAR EXPRESSION DIFFERENCES As of perl 5.005_03 the letter range regular expression such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example, characters such as o WITH CIRCUMFLEX that lie between I and J would not be matched by the regular expression range /[H-K]/. This works in the other direction, too, if either of the range end points is explicitly numeric: [\x89-\x91] will match \x8e, even though \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic viewpoint. If I specify [\x89-\x91] it just matches the end characters (i,j) and doesn't match any of the gapped characters( including \x8e), unlike what you had mentioned. Is this correct? -Sastry According to the above statement in perlebcdic.pod, s/[\x89-\x91]/X/g must substitute \x8e with X. But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e with X or not, since tr/// does not use brackets, [ ]. Though I think ranges in [ ] and ranges in tr/// should coincide and agree that tr/\x89-\x91/X/ should substitute \x8e with X, that is just my opinion. I don't know whether it is true and correct. Is there some way we can confirm if this is correct (and expected behaviour) since there isnt any explicit documentation for the tr operator ? By the way, when you say If I specify [\x89-\x91], does it mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused. We mean tr/\x89-\x91/X/. We are first informed by you that gapped characters are not substituted with X by tr/\x89-\x91/X/. And you said s/[\x89-\x91]/X/g substituted all the characters including gapped characters with X, hadn't you? Yes. If so, I assume your [\x89-\x91] which doesn't matching any of the gapped characters to be tr/\x89-\x91/X/. That's correct. We mean tr/\x89-\x91/X/. The following is a part of the current core tests from op/pat.t. I believe they should be passed. Yes all the following tests pass. I think the following tests are in the context of the s/[]/X/ operator and hence pass. Thanks, Rajarshi. Regards, SADAHIRO Tomoyuki +++begin # The 242 and 243 go with the 244 and 245. # The trick is that in EBCDIC the explicit numeric range should match # (as also in non-EBCDIC) but the explicit alphabetic range should not match. if (\x8e =~ /[\x89-\x91]/) { print ok 242\n; } else { print not ok 242\n; } if (\xce =~ /[\xc9-\xd1]/) { print ok 243\n; } else { print not ok 243\n; } # In most places these tests would succeed since \x8e does not # in most character sets match 'i' or 'j' nor would \xce match # 'I' or 'J', but strictly speaking these tests are here for # the good of EBCDIC, so let's test these only there. if (ord('i') == 0x89 ord('J') == 0xd1) { # EBCDIC if (\x8e !~ /[i-j]/) { print ok 244\n; } else { print not ok 244\n; } if (\xce !~ /[I-J]/) { print ok 245\n; } else { print not ok 245\n; } } else { for (244..245) { print ok $_ # Skip: only in EBCDIC\n; } } ---end __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
[PATCH] Re: Transliteration operator(tr//)on EBCDIC platform
On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das [EMAIL PROTECTED] wrote Hi, This is Rajarshi expressing Sastry's viewpoints since he's on vacation. SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: According to the above statement in perlebcdic.pod, s/[\x89-\x91]/X/g must substitute \x8e with X. But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e with X or not, since tr/// does not use brackets, [ ]. Though I think ranges in [ ] and ranges in tr/// should coincide and agree that tr/\x89-\x91/X/ should substitute \x8e with X, that is just my opinion. I don't know whether it is true and correct. Is there some way we can confirm if this is correct (and expected behaviour) since there isnt any explicit documentation for the tr operator ? Since t/op/tr.t already has a test case (cf. Change 9038) which Sastry previously pointed out its failing on EBCDIC Platform, I assume that at least the then pumpking thought it to be correct. By the way, when you say If I specify [\x89-\x91], does it mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused. We mean tr/\x89-\x91/X/. We are first informed by you that gapped characters are not substituted with X by tr/\x89-\x91/X/. And you said s/[\x89-\x91]/X/g substituted all the characters including gapped characters with X, hadn't you? Yes. If so, I assume your [\x89-\x91] which doesn't matching any of the gapped characters to be tr/\x89-\x91/X/. That's correct. We mean tr/\x89-\x91/X/. The following is a part of the current core tests from op/pat.t. I believe they should be passed. Yes all the following tests pass. I think the following tests are in the context of the s/[]/X/ operator and hence pass. Thanks, Rajarshi. OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem. Since I don't have any EBCDIC machine, I can't ensure the following patch will really makes sense. Regards, SADAHIRO Tomoyuki ! t/op/tr.t, toke.t diff -ur perl~/t/op/tr.t perl/t/op/tr.t --- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005 +++ perl/t/op/tr.t Thu Aug 11 23:41:22 2005 @@ -295,18 +295,15 @@ # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them, # from Karsten Sperling. -# Not working in EBCDIC as of 12674. $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($c, 8); is($a, ); - -# Not working in EBCDIC as of 12674. + $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/; is($c, 8); is($a, ); - -SKIP: { +SKIP: { skip not EBCDIC, 4 unless $Is_EBCDIC; $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/; diff -ur perl~/toke.c perl/toke.c --- perl~/toke.cMon Jul 18 04:31:02 2005 +++ perl/toke.c Thu Aug 11 22:55:18 2005 @@ -1368,6 +1368,9 @@ I32 has_utf8 = FALSE; /* Output constant is UTF8 */ I32 this_utf8 = UTF; /* The source string is assumed to be UTF8 */ UV uv; +#ifdef EBCDIC +UV literal_endpoint = 0; +#endif const char *leaveit = /* set of acceptably-backslashed characters */ PL_lex_inpat @@ -1417,8 +1420,9 @@ } #ifdef EBCDIC - if ((isLOWER(min) isLOWER(max)) || - (isUPPER(min) isUPPER(max))) { + if (literal_endpoint == 2 + ((isLOWER(min) isLOWER(max)) || +(isUPPER(min) isUPPER(max { if (isLOWER(min)) { for (i = min; i = max; i++) if (isLOWER(i)) @@ -1437,6 +1441,9 @@ /* mark the range as done, and continue */ dorange = FALSE; didrange = TRUE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif continue; } @@ -1455,6 +1462,9 @@ } else { didrange = FALSE; +#ifdef EBCDIC + literal_endpoint = 0; +#endif } } @@ -1788,6 +1798,10 @@ s++; continue; } /* end if (backslash) */ +#ifdef EBCDIC + else + literal_endpoint++; +#endif default_action: /* If we started with encoded form, or already know we want it ###END OF PATCH
Re: Transliteration operator(tr//)on EBCDIC platform
On 8/9/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: Hello, On Tue, 9 Aug 2005 15:09:42 +0530, Sastry [EMAIL PROTECTED] wrote Hi As suggested by you, I ran the following script which resulted in substituting all the characters with X irrespective of the special case [i-j]. ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g; is($a, ); Right, that behavior of ranges in character classes [ ] is expectable from literal_endpoint, which is introduced by Change 16556. cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556 I have also observed that whenever there are any gapped characters eg: [r-s] as in the following script, it just translates 'r' and 's' to X alone! ($a = \x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2) =~ tr/\x99-\xa2/X/; is($a, XX); a) Why is it mentioned that when [i-j] is included [\x89-\x91] should not be included? b) Do you think there is a bug in the tr// implementation as a consequence of the above? -Sastry Answer for a) is mentioned in perlebcdic.pod. The last sentence (This works in...) seems to be added there in accompanied with Change 16556 as above. +++quote begin REGULAR EXPRESSION DIFFERENCES As of perl 5.005_03 the letter range regular expression such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example, characters such as o WITH CIRCUMFLEX that lie between I and J would not be matched by the regular expression range /[H-K]/. This works in the other direction, too, if either of the range end points is explicitly numeric: [\x89-\x91] will match \x8e, even though \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic viewpoint. If I specify [\x89-\x91] it just matches the end characters (i,j) and doesn't match any of the gapped characters( including \x8e), unlike what you had mentioned. Is this correct? -Sastry quote end I'll give some additional explanations from the viewpoint of portability: a letter range [h-k] always means [hijk], even on EBCDIC platforms, but not [hi\x8A-\x90jk], because the string h is always the small letter 'h' whether its code value is 0x68 or 0x88; thus a numeric range [\x89-\x91] should always mean [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms, but not [\x89\x91], because the string \x89 always stands for the code value 0x89 whether it encodes a certain C1 control character or the letter 'i'. b): In my opinion the above change in [ ] for regular expressions is an improvement and a similar change in tr/// is also advisable. The reason why I hesitate to use the word bug is based on the following statement on tr/// in perlop.pod, esp. the last sentence: +++quote begin Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabets of equal case (a-e, A-E), or digits (0-4). Anything else is unsafe. If in doubt, spell out the character sets in full. quote end where numeric ranges such as \x89-\x91 are not declared to be safe, but to be unsafe. Regards, SADAHIRO Tomoyuki
Re: Transliteration operator(tr//)on EBCDIC platform
On Wed, 10 Aug 2005 14:06:56 +0530, Sastry [EMAIL PROTECTED] wrote As suggested by you, I ran the following script which resulted in substituting all the characters with X irrespective of the special case [i-j]. ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g; is($a, ); +++quote begin REGULAR EXPRESSION DIFFERENCES As of perl 5.005_03 the letter range regular expression such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example, characters such as o WITH CIRCUMFLEX that lie between I and J would not be matched by the regular expression range /[H-K]/. This works in the other direction, too, if either of the range end points is explicitly numeric: [\x89-\x91] will match \x8e, even though \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic viewpoint. If I specify [\x89-\x91] it just matches the end characters (i,j) and doesn't match any of the gapped characters( including \x8e), unlike what you had mentioned. Is this correct? -Sastry According to the above statement in perlebcdic.pod, s/[\x89-\x91]/X/g must substitute \x8e with X. But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e with X or not, since tr/// does not use brackets, [ ]. Though I think ranges in [ ] and ranges in tr/// should coincide and agree that tr/\x89-\x91/X/ should substitute \x8e with X, that is just my opinion. I don't know whether it is true and correct. By the way, when you say If I specify [\x89-\x91], does it mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused. We are first informed by you that gapped characters are not substituted with X by tr/\x89-\x91/X/. And you said s/[\x89-\x91]/X/g substituted all the characters including gapped characters with X, hadn't you? If so, I assume your [\x89-\x91] which doesn't matching any of the gapped characters to be tr/\x89-\x91/X/. The following is a part of the current core tests from op/pat.t. I believe they should be passed. Regards, SADAHIRO Tomoyuki +++begin # The 242 and 243 go with the 244 and 245. # The trick is that in EBCDIC the explicit numeric range should match # (as also in non-EBCDIC) but the explicit alphabetic range should not match. if (\x8e =~ /[\x89-\x91]/) { print ok 242\n; } else { print not ok 242\n; } if (\xce =~ /[\xc9-\xd1]/) { print ok 243\n; } else { print not ok 243\n; } # In most places these tests would succeed since \x8e does not # in most character sets match 'i' or 'j' nor would \xce match # 'I' or 'J', but strictly speaking these tests are here for # the good of EBCDIC, so let's test these only there. if (ord('i') == 0x89 ord('J') == 0xd1) { # EBCDIC if (\x8e !~ /[i-j]/) { print ok 244\n; } else { print not ok 244\n; } if (\xce !~ /[I-J]/) { print ok 245\n; } else { print not ok 245\n; } } else { for (244..245) { print ok $_ # Skip: only in EBCDIC\n; } } ---end
Re: Transliteration operator(tr//)on EBCDIC platform
Hello, On Tue, 9 Aug 2005 15:09:42 +0530, Sastry [EMAIL PROTECTED] wrote Hi As suggested by you, I ran the following script which resulted in substituting all the characters with X irrespective of the special case [i-j]. ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g; is($a, ); Right, that behavior of ranges in character classes [ ] is expectable from literal_endpoint, which is introduced by Change 16556. cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556 I have also observed that whenever there are any gapped characters eg: [r-s] as in the following script, it just translates 'r' and 's' to X alone! ($a = \x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2) =~ tr/\x99-\xa2/X/; is($a, XX); a) Why is it mentioned that when [i-j] is included [\x89-\x91] should not be included? b) Do you think there is a bug in the tr// implementation as a consequence of the above? -Sastry Answer for a) is mentioned in perlebcdic.pod. The last sentence (This works in...) seems to be added there in accompanied with Change 16556 as above. +++quote begin REGULAR EXPRESSION DIFFERENCES As of perl 5.005_03 the letter range regular expression such as [A-Z] and [a-z] have been especially coded to not pick up gap characters. For example, characters such as o WITH CIRCUMFLEX that lie between I and J would not be matched by the regular expression range /[H-K]/. This works in the other direction, too, if either of the range end points is explicitly numeric: [\x89-\x91] will match \x8e, even though \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic viewpoint. quote end I'll give some additional explanations from the viewpoint of portability: a letter range [h-k] always means [hijk], even on EBCDIC platforms, but not [hi\x8A-\x90jk], because the string h is always the small letter 'h' whether its code value is 0x68 or 0x88; thus a numeric range [\x89-\x91] should always mean [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms, but not [\x89\x91], because the string \x89 always stands for the code value 0x89 whether it encodes a certain C1 control character or the letter 'i'. b): In my opinion the above change in [ ] for regular expressions is an improvement and a similar change in tr/// is also advisable. The reason why I hesitate to use the word bug is based on the following statement on tr/// in perlop.pod, esp. the last sentence: +++quote begin Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabets of equal case (a-e, A-E), or digits (0-4). Anything else is unsafe. If in doubt, spell out the character sets in full. quote end where numeric ranges such as \x89-\x91 are not declared to be safe, but to be unsafe. Regards, SADAHIRO Tomoyuki
Re: Transliteration operator(tr//)on EBCDIC platform
On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote: Hi I am trying to run this script on an EBCDIC platform using perl-5.8.6 ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($a, ); The result I get is 'X«»ðý±°X' a) Is this happening since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped characters in EBCDIC ? I think so. In that \x89 is 'i' and \x91 is 'j'. b) Should all the bytes in $a change to X? I don't know. It seems to be some special case code in regexec.c: #ifdef EBCDIC /* In EBCDIC [\x89-\x91] should include * the \x8e but [i-j] should not. */ if (literal_endpoint == 2 ((isLOWER(prevvalue) isLOWER(ceilvalue)) || (isUPPER(prevvalue) isUPPER(ceilvalue { if (isLOWER(prevvalue)) { for (i = prevvalue; i = ceilvalue; i++) if (isLOWER(i)) ANYOF_BITMAP_SET(ret, i); } else { for (i = prevvalue; i = ceilvalue; i++) if (isUPPER(i)) ANYOF_BITMAP_SET(ret, i); } } else #endif which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not. I don't know where ranges in tr/// are parsed, but given that I grepped for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91// is treated as tr/i-j// and in turn i-j is treated as letters and always special cased I don't know if tr/i-j// and tr/\x89-\x91// should behave differently (ie whether we currently have a bug) Nicholas Clark
Re: Transliteration operator(tr//)on EBCDIC platform
On Mon, 8 Aug 2005 15:36:40 +0100, Nicholas Clark [EMAIL PROTECTED] wrote On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote: Hi I am trying to run this script on an EBCDIC platform using perl-5.8.6 ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($a, ); The result I get is 'X«»ðý±°X' a) Is this happening since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped characters in EBCDIC ? I think so. In that \x89 is 'i' and \x91 is 'j'. b) Should all the bytes in $a change to X? I don't know. It seems to be some special case code in regexec.c: #ifdef EBCDIC /* In EBCDIC [\x89-\x91] should include * the \x8e but [i-j] should not. */ if (literal_endpoint == 2 ((isLOWER(prevvalue) isLOWER(ceilvalue)) || (isUPPER(prevvalue) isUPPER(ceilvalue { if (isLOWER(prevvalue)) { for (i = prevvalue; i = ceilvalue; i++) if (isLOWER(i)) ANYOF_BITMAP_SET(ret, i); } else { for (i = prevvalue; i = ceilvalue; i++) if (isUPPER(i)) ANYOF_BITMAP_SET(ret, i); } } else #endif which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not. I don't know where ranges in tr/// are parsed, but given that I grepped for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91// is treated as tr/i-j// and in turn i-j is treated as letters and always special cased S_scan_const() in toke.c seems to expand ranges in tr///, while S_regclass() in regcomp.c (what I assume you mean) copes with those in []. from toke.c, line 1419 #ifdef EBCDIC if ((isLOWER(min) isLOWER(max)) || (isUPPER(min) isUPPER(max))) { if (isLOWER(min)) { for (i = min; i = max; i++) if (isLOWER(i)) *d++ = NATIVE_TO_NEED(has_utf8,i); } else { for (i = min; i = max; i++) if (isUPPER(i)) *d++ = NATIVE_TO_NEED(has_utf8,i); } } else #endif The former doesn't have thing like literal_endpoint in the latter; thus tr/// seem not to tell literals from metacharacters in ranges and tr/\x89-\x91/X/ will not replace \x8e in EBCDIC. Hmm, it may be a possible inconsistency in the case of EBCDIC. Sastry, would you please do the following codelet on your EBCDIC? ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g; is($a, ); Does that work similarly to yours? ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($a, ); Regards, SADAHIRO Tomoyuki
Transliteration operator(tr//)on EBCDIC platform
Hi I am trying to run this script on an EBCDIC platform using perl-5.8.6 ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/; is($a, ); The result I get is 'X«»ðý±°X' a) Is this happening since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped characters in EBCDIC ? or b) Should all the bytes in $a change to X? Thanks in advance Sastry