Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-20 Thread Sastry
Hi Sadahiro

All the existing test suite passes. But there are couple of new tests
failing probably due to multibyte representation \x{1000} which is
represented in three byte sequence in EBCDIC . These two tests are

$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/;
is($c, 8);
is($a, );

$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/;
is($c, 8);
is($a, );

The output is:

not ok 1
# Failed at t/op/tr_new.t line 32
#  got '6'
# expected '8'
not ok 2
# Failed at t/op/tr_new.t line 33
#  got 'XXXðýXXX'
# expected ''
not ok 3
# Failed at t/op/tr_new.t line 36
#  got '4'
# expected '8'
not ok 4
# Failed at t/op/tr_new.t line 37
#  got 'XXôöòõXX'
# expected ''

One observation is that since this unicode appears first in the tr//
as there seemed a problem in \x{100} case, Seems like it doesn't
handle the  multibyte (2)

regards
Sastry

On 9/19/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:

 On Thu, 15 Sep 2005 18:31:43 +0530, Sastry [EMAIL PROTECTED] wrote

  Hi Sadahiro
 
  Having incorporated the changes in the doop.c and op.c
  I strangely get lots of failures and here are the test results. Seems
  like the first approach itself fails on tr// and there will certainly
  more failures when we run the entire test suite which uses these
  functions.
 In the second approach, the change seems to be
  affecting only tr// . Please let me know your suggestions for the
  changes which I can apply in S_scan_const() and see if it works.
 
  regards
  Sastry

 Here it is.
 All newer codes in toke.t are enclosed between #ifdef EBCDIC
 and #endif since they are redundant for ASCII platform.
 And I add some tests to tr.t.

 Regards,
 SADAHIRO Tomoyuki

 ! toke.t t/op/tr.t

 diff -ur [EMAIL PROTECTED]/t/op/tr.t [EMAIL PROTECTED]/t/op/tr.t
 --- [EMAIL PROTECTED]/t/op/tr.tThu Aug 18 18:27:25 2005
 +++ [EMAIL PROTECTED]/t/op/tr.t  Sun Sep 18 19:59:13 2005
 @@ -6,7 +6,7 @@
 require './test.pl';
  }

 -plan tests = 100;
 +plan tests = 120;

  my $Is_EBCDIC = (ord('i') == 0x89  ord('J') == 0xd1);

 @@ -259,7 +259,6 @@

  # UTF8 range tests from Inaba Hiroto

 -# Not working in EBCDIC as of 12674.
  ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
  is($a, v192.196.172.194.197.172,'UTF range');

 @@ -272,6 +271,15 @@
  ($a = \x{0100}) =~ tr/\x00-\x{100}/X/;
  is($a, X);

 +($a = \x{0100}) =~ tr/\x00-\x{101}/X/;
 +is($a, X);
 +
 +($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/;
 +is($a, XX);
 +
 +($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/;
 +is($a, XX);
 +
  ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c;
  is($a, X);

 @@ -303,8 +311,16 @@
  is($c, 8);
  is($a, );

 +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/;
 +is($c, 8);
 +is($a, );
 +
 +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/;
 +is($c, 8);
 +is($a, );
 +
  SKIP: {
 -skip not EBCDIC, 4 unless $Is_EBCDIC;
 +skip not EBCDIC, 12 unless $Is_EBCDIC;

 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
 is($c, 2);
 @@ -313,7 +329,38 @@
 $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/;
 is($c, 2);
 is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
 +
 +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/;
 +is($c, 2);
 +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
 +
 +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/;
 +is($c, 2);
 +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
 +
 +$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/;
 +is($c, 2);
 +is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
 +
 +$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/;
 +is($c, 2);
 +is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
  }
 +
 +($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/;
 +is($a, );
 +
 +($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/;
 +is($a, );
 +
 +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/;
 +is($a, X);
 +
 +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/;
 +is($a, X);
 +
 +($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c;
 +is($a, \xfc\xfd\xfe\xffX);

  ($a = \x{100}) =~ tr/\x00-\xff/X/c;
  is(ord($a), ord(X));
 diff -ur [EMAIL PROTECTED]/toke.c [EMAIL PROTECTED]/toke.c
 --- [EMAIL PROTECTED]/toke.c   Wed Sep 14 17:40:19 2005
 +++ [EMAIL PROTECTED]/toke.c Mon Sep 19 12:05:41 2005
 @@ -1407,6 +1407,7 @@
 UV uv;
  #ifdef EBCDIC
 UV literal_endpoint = 0;
 +bool native_range = TRUE; /* turned to FALSE if the first endpoint is 
 Unicode */
  #endif

 const char *leaveit =  /* set of acceptably-backslashed characters */
 @@ -1429,8 +1430,14 @@
I32 i;  /* current expanded character 
 */
I32 min;/* first character in range */
I32 max;/* last character in range */
 -
 

Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-20 Thread SADAHIRO Tomoyuki

On Tue, 20 Sep 2005 15:51:34 +0530, Sastry [EMAIL PROTECTED] wrote

 Hi Sadahiro
 
 All the existing test suite passes. But there are couple of new tests
 failing probably due to multibyte representation \x{1000} which is
 represented in three byte sequence in EBCDIC . These two tests are
 
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/;
 is($c, 8);
 is($a, );
 
 $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/;
 is($c, 8);
 is($a, );
 
 The output is:
 
 not ok 1
 # Failed at t/op/tr_new.t line 32
 #  got '6'
 # expected '8'
 not ok 2
 # Failed at t/op/tr_new.t line 33
 #  got 'XXXðýXXX'
 # expected ''
 not ok 3
 # Failed at t/op/tr_new.t line 36
 #  got '4'
 # expected '8'
 not ok 4
 # Failed at t/op/tr_new.t line 37
 #  got 'XXôöòõXX'
 # expected ''
 
 One observation is that since this unicode appears first in the tr//
 as there seemed a problem in \x{100} case, Seems like it doesn't
 handle the  multibyte (2)
 
 regards
 Sastry

This newer patch uses NATIVE_TO_ASCII(i) instead of
NATIVE_TO_UTF(i). This is only thing which I found being wrong
about the prev patch; but your result seems different from my
expectation about how the output will be with NATIVE_TO_UTF(i)
in the prev patch...

If newer patch is still wrong, would you set DEBUG
in lib/utf8_heavy.pl to be true (that is to replace the line 5

sub DEBUG () { 0 }

to

sub DEBUG () { 1 }

and run it again? Then many verbose info will be out.

Regards,
SADAHIRO Tomoyuki

diff -ur [EMAIL PROTECTED]/t/op/tr.t perl/t/op/tr.t
--- [EMAIL PROTECTED]/t/op/tr.t Thu Aug 18 18:27:25 2005
+++ perl/t/op/tr.t  Sun Sep 18 19:59:13 2005
@@ -6,7 +6,7 @@
 require './test.pl';
 }
 
-plan tests = 100;
+plan tests = 120;
 
 my $Is_EBCDIC = (ord('i') == 0x89  ord('J') == 0xd1);
 
@@ -259,7 +259,6 @@
 
 # UTF8 range tests from Inaba Hiroto
 
-# Not working in EBCDIC as of 12674.
 ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
 is($a, v192.196.172.194.197.172,'UTF range');
 
@@ -272,6 +271,15 @@
 ($a = \x{0100}) =~ tr/\x00-\x{100}/X/;
 is($a, X);
 
+($a = \x{0100}) =~ tr/\x00-\x{101}/X/;
+is($a, X);
+
+($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/;
+is($a, XX);
+
+($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/;
+is($a, XX);
+
 ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c;
 is($a, X);
 
@@ -303,8 +311,16 @@
 is($c, 8);
 is($a, );
 
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/;
+is($c, 8);
+is($a, );
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/;
+is($c, 8);
+is($a, );
+
 SKIP: {
-skip not EBCDIC, 4 unless $Is_EBCDIC;
+skip not EBCDIC, 12 unless $Is_EBCDIC;
 
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
 is($c, 2);
@@ -313,7 +329,38 @@
 $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/;
 is($c, 2);
 is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
+
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/;
+is($c, 2);
+is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/;
+is($c, 2);
+is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
+
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/;
+is($c, 2);
+is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/;
+is($c, 2);
+is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
 }
+
+($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/;
+is($a, );
+
+($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/;
+is($a, );
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/;
+is($a, X);
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/;
+is($a, X);
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c;
+is($a, \xfc\xfd\xfe\xffX);
 
 ($a = \x{100}) =~ tr/\x00-\xff/X/c;
 is(ord($a), ord(X));
diff -ur [EMAIL PROTECTED]/toke.c perl/toke.c
--- [EMAIL PROTECTED]/toke.cWed Sep 14 17:40:19 2005
+++ perl/toke.c Tue Sep 20 23:09:13 2005
@@ -1407,6 +1407,7 @@
 UV uv;
 #ifdef EBCDIC
 UV literal_endpoint = 0;
+bool native_range = TRUE; /* turned to FALSE if the first endpoint is 
Unicode */
 #endif
 
 const char *leaveit =  /* set of acceptably-backslashed characters */
@@ -1429,8 +1430,14 @@
I32 i;  /* current expanded character */
I32 min;/* first character in range */
I32 max;/* last character in range */
-
-   if (has_utf8) {
+#ifdef EBCDIC
+   UV  uvmax = 0;  /* last character above byte */
+#endif
+   if (has_utf8
+#ifdef EBCDIC
+!native_range
+#endif
+   ) {
char * const c = (char*)utf8_hop((U8*)d, -1);
char *e = d++;
while (e--  c)
@@ -1443,12 

Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-19 Thread SADAHIRO Tomoyuki

On Thu, 15 Sep 2005 18:31:43 +0530, Sastry [EMAIL PROTECTED] wrote

 Hi Sadahiro
 
 Having incorporated the changes in the doop.c and op.c
 I strangely get lots of failures and here are the test results. Seems
 like the first approach itself fails on tr// and there will certainly
 more failures when we run the entire test suite which uses these
 functions.
In the second approach, the change seems to be
 affecting only tr// . Please let me know your suggestions for the
 changes which I can apply in S_scan_const() and see if it works.
 
 regards
 Sastry

Here it is.
All newer codes in toke.t are enclosed between #ifdef EBCDIC
and #endif since they are redundant for ASCII platform.
And I add some tests to tr.t.

Regards,
SADAHIRO Tomoyuki

! toke.t t/op/tr.t

diff -ur [EMAIL PROTECTED]/t/op/tr.t [EMAIL PROTECTED]/t/op/tr.t
--- [EMAIL PROTECTED]/t/op/tr.t Thu Aug 18 18:27:25 2005
+++ [EMAIL PROTECTED]/t/op/tr.t Sun Sep 18 19:59:13 2005
@@ -6,7 +6,7 @@
 require './test.pl';
 }
 
-plan tests = 100;
+plan tests = 120;
 
 my $Is_EBCDIC = (ord('i') == 0x89  ord('J') == 0xd1);
 
@@ -259,7 +259,6 @@
 
 # UTF8 range tests from Inaba Hiroto
 
-# Not working in EBCDIC as of 12674.
 ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
 is($a, v192.196.172.194.197.172,'UTF range');
 
@@ -272,6 +271,15 @@
 ($a = \x{0100}) =~ tr/\x00-\x{100}/X/;
 is($a, X);
 
+($a = \x{0100}) =~ tr/\x00-\x{101}/X/;
+is($a, X);
+
+($a = \x{0100}\x{0101}) =~ tr/\x00-\x{102}/X/;
+is($a, XX);
+
+($a = \x{0101}\x{0102}) =~ tr/\x00-\x{103}/X/;
+is($a, XX);
+
 ($a = \x{0100}) =~ tr/\x{}-\x{00ff}/X/c;
 is($a, X);
 
@@ -303,8 +311,16 @@
 is($c, 8);
 is($a, );
 
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}\x89-\x91/X/;
+is($c, 8);
+is($a, );
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}\xc9-\xd1/X/;
+is($c, 8);
+is($a, );
+
 SKIP: {
-skip not EBCDIC, 4 unless $Is_EBCDIC;
+skip not EBCDIC, 12 unless $Is_EBCDIC;
 
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
 is($c, 2);
@@ -313,7 +329,38 @@
 $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J/X/;
 is($c, 2);
 is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
+
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{1000}i-j/X/;
+is($c, 2);
+is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\x{1000}I-J/X/;
+is($c, 2);
+is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
+
+$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j\x{1000}/X/;
+is($c, 2);
+is($a, X\x8a\x8b\x8c\x8d\x8f\x90X);
+
+$c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/I-J\x{1000}/X/;
+is($c, 2);
+is($a, X\xca\xcb\xcc\xcd\xcf\xd0X);
 }
+
+($a = \xfc\xfd\xfe\xff) =~ tr/\x00-\xff/X/;
+is($a, );
+
+($a = \xfc\xfd\xfe\xff) =~ tr/\x{1000}\x00-\xff/X/;
+is($a, );
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\x{100}/X/;
+is($a, X);
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x00-\x{200}/X/;
+is($a, X);
+
+($a = \xfc\xfd\xfe\xff\x{100}) =~ tr/\x{1000}\x00-\xff/X/c;
+is($a, \xfc\xfd\xfe\xffX);
 
 ($a = \x{100}) =~ tr/\x00-\xff/X/c;
 is(ord($a), ord(X));
diff -ur [EMAIL PROTECTED]/toke.c [EMAIL PROTECTED]/toke.c
--- [EMAIL PROTECTED]/toke.cWed Sep 14 17:40:19 2005
+++ [EMAIL PROTECTED]/toke.cMon Sep 19 12:05:41 2005
@@ -1407,6 +1407,7 @@
 UV uv;
 #ifdef EBCDIC
 UV literal_endpoint = 0;
+bool native_range = TRUE; /* turned to FALSE if the first endpoint is 
Unicode */
 #endif
 
 const char *leaveit =  /* set of acceptably-backslashed characters */
@@ -1429,8 +1430,14 @@
I32 i;  /* current expanded character */
I32 min;/* first character in range */
I32 max;/* last character in range */
-
-   if (has_utf8) {
+#ifdef EBCDIC
+   UV  uvmax = 0;  /* last character above byte */
+#endif
+   if (has_utf8
+#ifdef EBCDIC
+!native_range
+#endif
+   ) {
char * const c = (char*)utf8_hop((U8*)d, -1);
char *e = d++;
while (e--  c)
@@ -1443,12 +1450,41 @@
}
 
i = d - SvPVX_const(sv);/* remember current 
offset */
+#ifdef EBCDIC
+   SvGROW(sv, SvLEN(sv) + (has_utf8
+   ? (512 - UTF_CONTINUATION_MARK + UNISKIP(0x100))
+   : 256));
+   /* how many two-byte within 0..255: 128 in UTF-8, 96 in 
UTF-8-mod */
+#else
SvGROW(sv, SvLEN(sv) + 256);/* never more than 256 chars in 
a range */
+#endif
d = SvPVX(sv) + i;  /* refresh d after realloc */
-   d -= 2; /* eat the first char and the - 
*/
 
+#ifdef EBCDIC
+   if (has_utf8) {
+   

Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-15 Thread Sastry
Hi Sadahiro

Having incorporated the changes in the doop.c and op.c
I strangely get lots of failures and here are the test results. Seems
like the first approach itself fails on tr// and there will certainly
more failures when we run the entire test suite which uses these
functions.
   In the second approach, the change seems to be
affecting only tr// . Please let me know your suggestions for the
changes which I can apply in S_scan_const() and see if it works.

regards
Sastry


# Failed at t/op/tr.t line 110
#  got 'š\''
Wide character in print at ./test.pl line 48.
# expected '΋\''
# Failed at t/op/tr.t line 209
Wide character in print at ./test.pl line 48.
#  got '¯œD–㯜D–ã'
Wide character in print at ./test.pl line 48.
# expected '¯œ¯Û–㯜¯Û–ã'
# Failed at t/op/tr.t line 219
#  got 'CDÚCDÚ'
Wide character in print at ./test.pl line 48.
# expected 'C¯Û–ãC¯Û–ã'
# Failed at t/op/tr.t line 224
Wide character in print at ./test.pl line 48.
#  got 'ED–ãED–㌨Føã'
Wide character in print at ./test.pl line 48.
# expected 'E¯Û[E¯Û[Œ¨Føã'
# Failed at t/op/tr.t line 234
Wide character in print at ./test.pl line 48.
#  got '¯Û¯Û¯Û¯Û¯Û¯Û'
Wide character in print at ./test.pl line 48.
# expected '¯ÛD¯Û¯ÛD¯Û'
# Failed at t/op/tr.t line 283
Wide character in print at ./test.pl line 48.
#  got '¯œD–㯥E–ã'
Wide character in print at ./test.pl line 48.
# expected '¯œ¯œ–㯥¯Û–ã'
# Failed at t/op/tr.t line 350
#  got '§ÿ'
Wide character in print at ./test.pl line 48.
# expected 'ΰÎ'
1..99
ok 1 - uc
ok 2 - lc
ok 3 - partial uc
ok 4 - EBCDIC discontinuity
ok 5 - tr cancels IOK and NOK
ok 6 - harmless if explicitly not updating
ok 7 - harmless if implicitly not updating
ok 8 - no error
ok 9 - handles UTF8
ok 10
ok 11
ok 12
ok 13
ok 14
ok 15
ok 16
ok 17 - changing UTF8 chars in a UTF8 string, same length
ok 18
ok 19 - more bytes
ok 20
not ok 21 - Putting UT8 chars into a non-UTF8 string
ok 22
ok 23 - Removing UTF8 chars from UTF8 string
ok 24
ok 25 - Counting UTF8 chars in UTF8 string
ok 26 -  non-UTF8 chars in UTF8 string
ok 27 -  UTF8 chars in non-UTFs string
ok 28 - tr/a-z-9//
ok 29 - hyphens, leading
ok 30 -trailing
ok 31 -both
ok 32
ok 33
ok 34
ok 35 - reversed range check
ok 36 - cannot update read-only var
ok 37 - explicit read-only count
ok 38 - no error
ok 39 - implicit read-only count
ok 40 - no error
ok 41 - LHS of non-updating tr
ok 42 - LHS bad on updating tr
ok 43 - byte2byte transliteration
ok 44
ok 45
ok 46
not ok 47 - byte2wide transliteration
ok 48 -wide2byte
ok 49 -wide2wide
not ok 50 - byte2wide  wide2byte
not ok 51 - all together now!
ok 52 - transliterate and count
ok 53
not ok 54 - translit w/complement
ok 55
ok 56 - translit w/deletion
ok 57
ok 58 - translit w/squeeze
ok 59
ok 60
ok 61
ok 62
ok 63 - UTF range
not ok 64
ok 65
ok 66
ok 67
ok 68
ok 69
ok 70
ok 71
ok 72
ok 73
ok 74
ok 75
ok 76
ok 77
ok 78
ok 79
ok 80
ok 81
ok 82
not ok 83
ok 84
ok 85
ok 86
ok 87
ok 88 - pp_trans needs to unshare shared hash keys
ok 89 -no error
ok 90 - implicit count on constant
ok 91 -no error
ok 92 - implicit count outside array bounds, index negative
ok 93 - doesn't extend the array
ok 94 - implicit count outside array bounds, index positive
ok 95 - doesn't extend the array
ok 96 - implicit count outside hash bounds
ok 97 - doesn't extend the hash
ok 98 - non-modifying tr/// on a scalar ref
ok 99 - doesn't stringify its argument




On 9/14/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
 On Wed, 14 Sep 2005 16:50:26 +0530, Sastry [EMAIL PROTECTED] wrote
 
  Hi Sadahiro
 
  On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
  
   I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to
   such an ambiguity of \xc0-\xc4. In this expression the left part
   \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode,
   and results in the failure.
  So this is still a problem on EBCDIC! Is there a way to fix this?
 
   #test case B # On ASCII platform, of course successful
   $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/;
   is($c, 8);
   is($a, );
  This test fails on EBCDIC.  In S_scan_const(), there is a statement below.
  /* Insert oct or hex escaped character.
 * There will always enough room in sv since such
 * escapes will be longer than any UTF-8 sequence
 * they can end up as. */
 
/* We need to map to chars to ASCII before doing the tests
   to cover EBCDIC
*/
if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) {
   if (!has_utf8  uv  255) {
 
  on an ASCII , the first if condition is true as uv is 137  and it
  falls in the variant range as uv \x7F whereas on EBCDIC the if
  condition is false. Can you explain why this behaviour is?
 
 see else for this if. This condition tests whether uv needs
 

Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-14 Thread SADAHIRO Tomoyuki

On Wed, 14 Sep 2005 16:50:26 +0530, Sastry [EMAIL PROTECTED] wrote

 Hi Sadahiro
 
 On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
  
  I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to
  such an ambiguity of \xc0-\xc4. In this expression the left part
  \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode,
  and results in the failure.
 So this is still a problem on EBCDIC! Is there a way to fix this?

  #test case B # On ASCII platform, of course successful
  $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/;
  is($c, 8);
  is($a, );
 This test fails on EBCDIC.  In S_scan_const(), there is a statement below.
 /* Insert oct or hex escaped character.
* There will always enough room in sv since such
* escapes will be longer than any UTF-8 sequence
* they can end up as. */
   
   /* We need to map to chars to ASCII before doing the tests
  to cover EBCDIC
   */
   if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) {
  if (!has_utf8  uv  255) {
 
 on an ASCII , the first if condition is true as uv is 137  and it
 falls in the variant range as uv \x7F whereas on EBCDIC the if
 condition is false. Can you explain why this behaviour is?

see else for this if. This condition tests whether uv needs
multiple octets in UTF-8/UTF-EBCDIC or only needs a single octet.
\x89 in Latin-1 corresponds to a double-octet representation
in UTF-8, and true (that needs multiple octets) on ASCII platform.
\x89 in EBCDIC corresponds to a single-octet representation
in UTF-EBCDIC, and false on EBCDIC platform.

Where else runs, there is no difference between ASCII and UTF-8;
or between single-octet EBCDIC and UTF-EBCDIC. 

 Also I found that the characters are expanded during runtime in
 S_do_trans_simple_utf8()

If I understand it correctly, expansion of character ranges isn't
performed in do_trans_simple_utf8(). It is performed in scan_const()
for non-Unicode and pmtrans() for Unicode.

 Do you have any suggestion where the problem is?

(1) one way (I think worse)
Perl should treat the range in the native order (not in Unicode one)
through the parse time, the compile time, and the run time.

using uvchr_to_utf8() instead of uvuni_to_utf8(),
  utf8n_to_uvchr() instead of utf8n_to_uvuni(),
in op.c#pmtrans and doop.c#do_trans_simple_utf8 etc.

But swash_fetch() also needs change (the current swash does not
know EBCDIC, only Unicode); changes of swash may lead to
corruption of lc(), uc(), regular expression \p{something} etc.

(2) another way (I think better)
No change of swash, pmtrans, do_trans_.

Then all character ranges within 0..255 (not only for non-Unicode
but also for Unicode) to be expanded in scan_const().
(and pmtrans() will expand only uv = 256).

I think this way requires only the change of toke.c#scan_const
and influences only tr///.

But the change will be quite big, since the current scan_const()
only expands non-Unicode and assumes a single octet encoding.
The range 0..255 in UTF-8/UTF-EBCDIC includes double-octet characters.

I'm not sure whether such a change should be enclosed
with #ifdef EBCDIC and #endif

Regards,
SADAHIRO Tomoyuki




Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-14 Thread Sastry
Hi Sadahiro

On 9/12/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
 On Mon, 12 Sep 2005 16:12:45 +0530, Sastry [EMAIL PROTECTED] wrote
 
  Hi Sadahiro
  
  
   On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
  
  
  Do you think that perl-5.8.6 is not expanding the character ranges with 
  Unicode? If so how is this test case working?
   ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/;
  All the bytes are translated to Y
   regards
  -Sastry
 
 Beyond 255 (\x{ff}), I think it will be correctly expanded.
 \x{12c}-\x{130} is beyond 255, and thus no problem.
 
 In the range of 0..255 (inclusive), I think generally no for EBCDIC.
 (Why I don't say always no is that there are some cases where
  a character range in EBCDIC coincides with that in Unicode:
  for example 0-9 can be successfully expanded into 0123456789
  in both encodings)
 
 I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to
 such an ambiguity of \xc0-\xc4. In this expression the left part
 \x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode,
 and results in the failure.
So this is still a problem on EBCDIC! Is there a way to fix this?

 
 In contrast, I attribute the success in tr/\xc0-\xc4/\x{12c}-\x{130}/;
 to that \xc0-\xc4 is parsed before \x{12c}-\x{130}, and then
 \xc0-\xc4 is expanded into \xc0\xc1\xc2\xc3\xc4 as EBCDIC
 before the character list is coerced into Unicode.
 
 
 Well, how about the tese case B? (It has \x{100} at first and
 then both sides are coerced into Unicode.)
 
 #test case A # now resolved
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
 is($c, 8);
 is($a, );
 
 #test case B # On ASCII platform, of course successful
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/;
 is($c, 8);
 is($a, );
This test fails on EBCDIC.  In S_scan_const(), there is a statement below.
/* Insert oct or hex escaped character.
 * There will always enough room in sv since such
 * escapes will be longer than any UTF-8 sequence
 * they can end up as. */

/* We need to map to chars to ASCII before doing the tests
   to cover EBCDIC
*/
if (!UNI_IS_INVARIANT(NATIVE_TO_UNI(uv))) {
 if (!has_utf8  uv  255) {

on an ASCII , the first if condition is true as uv is 137  and it
falls in the variant range as uv \x7F whereas on EBCDIC the if
condition is false. Can you explain why this behaviour is?
Also I found that the characters are expanded during runtime in
S_do_trans_simple_utf8()
Do you have any suggestion where the problem is?

 
 I think the current perl on EBCDIC does not translate gap characters
 for the test case B, which works like tr/\x{100}i-j/X/
 and results in $c == 2, and $a eq X\x8a\x8b\x8c\x8d\x8f\x90X;
 because i's next character is j in Unicode.
It expands the range but doesn't translate.

 
 And then try this:
 #test case C # On ASCII platform, of course successful
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91\x{100}/X/;
 is($c, 8);
 is($a, );
This works fine

 
 I think the test case C would success even on EBCDIC, because
 the expansion from \x89-\x91 to \x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91
 will be done before the parser finds \x{100}.
 

 Regards,
 SADAHIRO Tomoyuki
 
 
 

regards
Sastry
--


Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-12 Thread Sastry
Hi Sadahiro


 On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:

 
 On Wed, 31 Aug 2005 19:53:37 +0530, Sastry [EMAIL PROTECTED] wrote
 
  Hi Sadahiro
  The patch has resolved four tests that were failing previously but one
  more test is stilling failing(which was failing even before applying the
  patch).
  Here is the test case
 
  ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
  is($a, v192.196.172.194.197.172, 'UTF range');
  # got 'DÐDEÐ'
  # expected '{DÐBEÐ'
  Can you suggest some pointers towards fixing this?
  -Sastry
 
 This EBCDIC-specific problem is based on how to treat with code values
 including Unicode (\x{12c}-\x{130} is surely Unicode) on EBCDIC platform.
 Native code values in EBCDIC (for example 'A' == 193) almost differs
 from the range of 0..255 in Unicode (for example 'A' == 65) which
 coincides with ASCII/Latin1.
 
 Thus the middle part of a character range is gererally different
 between EBCDIC and Unicode.
 
 For example consider a character range \xc0-\xc4. Since the mappings
 \xc0 to '{' (an open curly) and \xc4 to D in EBCDIC are definite,
 the range \xc0-\xc4 is equivalent to {-D on EBCDIC platform.
 
 In EBCDIC {-D (\xc0-\xc4) can be expanded to \xc0\xc1\xc2\xc3\xc4,
 but in Unicode {-D cannot be expanded, as the Unicode scalar values
 of the endpoints are reverse ('{' = U+007B, D = U+0044).

  
 Actually the current perl implementation is confused:
 in the parse time (see toke.c#scan_const) perl treats the range
 in EBCDIC order and then does not catch as Invalid range,
 though in the compile time (see op.c#pmtrans) and the run time
 (see doop.c#do_trans_simple_utf8 and its friends) perl treats
 the range in Unicode order and then generates a strange result.
  For this test since the min  max in scan_const, as per their Unicode 
 values, should we complain warning, in which case the test case is wrong in 
 EBCDIC platform! Am I correct?

  
 In my opinion it is necessary to determine how to expand character
 ranges with Unicode (whether the native EBCDIC order or Unicode order).
 I'm not sure using the native encoding (ASCII/Latin1/EBCDIC) everytime
 (that is same as no Unicode within 0..255) makes people happy.

 Do you think that perl-5.8.6 is not expanding the character ranges with 
Unicode? If so how is this test case working?
 ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/;
All the bytes are translated to Y
 regards
-Sastry

 
 Regards,
 SADAHIRO Tomoyuki
 
 



Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-12 Thread SADAHIRO Tomoyuki

On Mon, 12 Sep 2005 16:12:45 +0530, Sastry [EMAIL PROTECTED] wrote

 Hi Sadahiro
 
 
  On 9/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
 
 Do you think that perl-5.8.6 is not expanding the character ranges with 
 Unicode? If so how is this test case working?
  ($a = \x{12d}\x{12e}\x{12f}\x{130}) =~ tr/\x{12c}-\x{130}/Y/;
 All the bytes are translated to Y
  regards
 -Sastry

Beyond 255 (\x{ff}), I think it will be correctly expanded.
\x{12c}-\x{130} is beyond 255, and thus no problem.

In the range of 0..255 (inclusive), I think generally no for EBCDIC.
(Why I don't say always no is that there are some cases where
 a character range in EBCDIC coincides with that in Unicode:
 for example 0-9 can be successfully expanded into 0123456789
 in both encodings)

I attribute the failure in tr/\x{12c}-\x{130}/\xc0-\xc4/; to
such an ambiguity of \xc0-\xc4. In this expression the left part
\x{12c}-\x{130} parsed before coerces \xc0-\xc4 into Unicode,
and results in the failure.

In contrast, I attribute the success in tr/\xc0-\xc4/\x{12c}-\x{130}/;
to that \xc0-\xc4 is parsed before \x{12c}-\x{130}, and then
\xc0-\xc4 is expanded into \xc0\xc1\xc2\xc3\xc4 as EBCDIC
before the character list is coerced into Unicode.


Well, how about the tese case B? (It has \x{100} at first and
then both sides are coerced into Unicode.)

#test case A # now resolved
$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
is($c, 8);
is($a, );

#test case B # On ASCII platform, of course successful
$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x{100}\x89-\x91/X/;
is($c, 8);
is($a, );

I think the current perl on EBCDIC does not translate gap characters
for the test case B, which works like tr/\x{100}i-j/X/
and results in $c == 2, and $a eq X\x8a\x8b\x8c\x8d\x8f\x90X;
because i's next character is j in Unicode.

And then try this:
#test case C # On ASCII platform, of course successful
$c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91\x{100}/X/;
is($c, 8);
is($a, );

I think the test case C would success even on EBCDIC, because
the expansion from \x89-\x91 to \x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91
will be done before the parser finds \x{100}.

Regards,
SADAHIRO Tomoyuki




Re: Transliteration operator(tr//)on EBCDIC platform

2005-09-10 Thread SADAHIRO Tomoyuki

On Wed, 31 Aug 2005 19:53:37 +0530, Sastry [EMAIL PROTECTED] wrote

 Hi Sadahiro
   The patch has resolved four tests that were failing previously but one 
 more test is stilling failing(which was failing even before applying the 
 patch).
  Here is the test case
  
 ($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
 is($a, v192.196.172.194.197.172, 'UTF range');
  # got 'DÐDEÐ'
 # expected '{DÐBEÐ'
  Can you suggest some pointers towards fixing this?
  -Sastry

This EBCDIC-specific problem is based on how to treat with code values
including Unicode (\x{12c}-\x{130} is surely Unicode) on EBCDIC platform.
Native code values in EBCDIC (for example 'A' == 193) almost differs
from the range of 0..255 in Unicode (for example 'A' == 65) which
coincides with ASCII/Latin1.

Thus the middle part of a character range is gererally different
between EBCDIC and Unicode.

For example consider a character range \xc0-\xc4. Since the mappings
\xc0 to '{' (an open curly) and \xc4 to D in EBCDIC are definite,
the range \xc0-\xc4 is equivalent to {-D on EBCDIC platform.

In EBCDIC {-D (\xc0-\xc4) can be expanded to \xc0\xc1\xc2\xc3\xc4,
but in Unicode {-D cannot be expanded, as the Unicode scalar values
of the endpoints are reverse ('{' = U+007B, D = U+0044).

Actually the current perl implementation is confused:
in the parse time (see toke.c#scan_const) perl treats the range
in EBCDIC order and then does not catch as Invalid range,
though in the compile time (see op.c#pmtrans) and the run time
(see doop.c#do_trans_simple_utf8 and its friends) perl treats
the range in Unicode order and then generates a strange result.

In my opinion it is necessary to determine how to expand character
ranges with Unicode (whether the native EBCDIC order or Unicode order).
I'm not sure using the native encoding (ASCII/Latin1/EBCDIC) everytime
(that is same as no Unicode within 0..255) makes people happy.

Regards,
SADAHIRO Tomoyuki




Re: Fw: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-31 Thread Sastry
Hi Sadahiro
  The patch has resolved four tests that were failing previously but one 
more test is stilling failing(which was failing even before applying the 
patch).
 Here is the test case
 
($a = v300.196.172.302.197.172) =~ tr/\x{12c}-\x{130}/\xc0-\xc4/;
is($a, v192.196.172.194.197.172, 'UTF range');
 # got 'DÐDEÐ'
# expected '{DÐBEÐ'
 Can you suggest some pointers towards fixing this?
 -Sastry
 On 8/16/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote: 
 
 perl5 porters,
 
 There is a response in approval from Sastry to my proposed patch.
 I'll forward it and now submit the proposal (on my prev mail) to p5p.
 
 Regards,
 SADAHIRO Tomoyuki
 
 Forwarded by SADAHIRO Tomoyuki [EMAIL PROTECTED]
 --- Original Message ---
 From: Sastry [EMAIL PROTECTED]
 To: SADAHIRO Tomoyuki [EMAIL PROTECTED]
 Date: Tue, 16 Aug 2005 15:27:45 +0530
 Subject: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform
 
 
 Hi
 The patch works now as expected.
 
 Thanks
 -Sastry
 
 
 On 8/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
  On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das 
 [EMAIL PROTECTED] wrote
 
   Hi,
   This is Rajarshi expressing Sastry's viewpoints since he's on 
 vacation.
  
   SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
  
   According to the above statement in perlebcdic.pod,
   s/[\x89-\x91]/X/g must substitute \x8e with X.
   But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
   with X or not, since tr/// does not use brackets, [ ].
  
   Though I think ranges in [ ] and ranges in tr/// should coincide
   and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
   that is just my opinion.
   I don't know whether it is true and correct.
   Is there some way we can confirm if this is correct (and expected 
 behaviour)
   since there isnt any explicit documentation for the tr operator ?
 
  Since t/op/tr.t already has a test case (cf. Change 9038)
  which Sastry previously pointed out its failing on EBCDIC Platform,
  I assume that at least the then pumpking thought it to be correct.
 
   By the way, when you say If I specify [\x89-\x91], does it
   mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused.
   We mean tr/\x89-\x91/X/.
  
  
   We are first informed by you that gapped characters are not
   substituted with X by tr/\x89-\x91/X/.
   And you said s/[\x89-\x91]/X/g substituted all the characters
   including gapped characters with X, hadn't you?
  
   Yes.
   If so, I assume your [\x89-\x91] which doesn't matching any of
   the gapped characters to be tr/\x89-\x91/X/.
   That's correct. We mean tr/\x89-\x91/X/.
  
  
   The following is a part of the current core tests from op/pat.t.
   I believe they should be passed.
   Yes all the following tests pass. I think the following tests are in 
 the context of the
   s/[]/X/ operator and hence pass.
  
   Thanks,
  
   Rajarshi.
 
  OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem.
  Since I don't have any EBCDIC machine, I can't ensure the following
  patch will really makes sense.
 
  Regards,
  SADAHIRO Tomoyuki
 
  ! t/op/tr.t, toke.t
 
  diff -ur perl~/t/op/tr.t perl/t/op/tr.t
  --- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005
  +++ perl/t/op/tr.t Thu Aug 11 23:41:22 2005
  @@ -295,18 +295,15 @@
  # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them,
  # from Karsten Sperling.
 
  -# Not working in EBCDIC as of 12674.
  $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
  is($c, 8);
  is($a, );
  -
  -# Not working in EBCDIC as of 12674.
  +
  $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/;
  is($c, 8);
  is($a, );
 
  -
  -SKIP: {
  +SKIP: {
  skip not EBCDIC, 4 unless $Is_EBCDIC;
 
  $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
  diff -ur perl~/toke.c perl/toke.c
  --- perl~/toke.c Mon Jul 18 04:31:02 2005
  +++ perl/toke.c Thu Aug 11 22:55:18 2005
  @@ -1368,6 +1368,9 @@
  I32 has_utf8 = FALSE; /* Output constant is UTF8 */
  I32 this_utf8 = UTF; /* The source string is assumed to be UTF8 */
  UV uv;
  +#ifdef EBCDIC
  + UV literal_endpoint = 0;
  +#endif
 
  const char *leaveit = /* set of acceptably-backslashed characters */
  PL_lex_inpat
  @@ -1417,8 +1420,9 @@
  }
 
  #ifdef EBCDIC
  - if ((isLOWER(min)  isLOWER(max)) ||
  - (isUPPER(min)  isUPPER(max))) {
  + if (literal_endpoint == 2 
  + ((isLOWER(min)  isLOWER(max)) ||
  + (isUPPER(min)  isUPPER(max {
  if (isLOWER(min)) {
  for (i = min; i = max; i++)
  if (isLOWER(i))
  @@ -1437,6 +1441,9 @@
  /* mark the range as done, and continue */
  dorange = FALSE;
  didrange = TRUE;
  +#ifdef EBCDIC
  + literal_endpoint = 0;
  +#endif
  continue;
  }
 
  @@ -1455,6 +1462,9 @@
  }
  else {
  didrange = FALSE;
  +#ifdef EBCDIC
  + literal_endpoint = 0;
  +#endif
  }
  }
 
  @@ -1788,6 +1798,10 @@
  s++;
  continue;
  } /* end if (backslash) */
  +#ifdef EBCDIC
  + else
  + literal_endpoint++;
  +#endif

Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-18 Thread Rafael Garcia-Suarez
SADAHIRO Tomoyuki wrote:
 perl5 porters,
 
 There is a response in approval from Sastry to my proposed patch.
 I'll forward it and now submit the proposal (on my prev mail) to p5p.

Thanks, applied as change #25303 to bleadperl.


Fw: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-16 Thread SADAHIRO Tomoyuki
perl5 porters,

There is a response in approval from Sastry to my proposed patch.
I'll forward it and now submit the proposal (on my prev mail) to p5p.

Regards,
SADAHIRO Tomoyuki

Forwarded by SADAHIRO Tomoyuki [EMAIL PROTECTED]
--- Original Message ---
 From:Sastry [EMAIL PROTECTED]
 To:  SADAHIRO Tomoyuki [EMAIL PROTECTED]
 Date:Tue, 16 Aug 2005 15:27:45 +0530
 Subject: Re: [PATCH] Re: Transliteration operator(tr//)on EBCDIC platform


Hi 
The patch works now as expected.

Thanks
-Sastry


On 8/11/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
 On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das [EMAIL PROTECTED] 
 wrote
 
  Hi,
  This is Rajarshi expressing Sastry's viewpoints since he's on vacation.
 
  SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
  According to the above statement in perlebcdic.pod,
  s/[\x89-\x91]/X/g must substitute \x8e with X.
  But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
  with X or not, since tr/// does not use brackets, [ ].
 
  Though I think ranges in [ ] and ranges in tr/// should coincide
  and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
  that is just my opinion.
  I don't know whether it is true and correct.
  Is there some way we can confirm if this is correct (and expected behaviour)
  since there isnt any explicit documentation for the tr operator ?
 
 Since t/op/tr.t already has a test case (cf. Change 9038)
 which Sastry previously pointed out its failing on EBCDIC Platform,
 I assume that at least the then pumpking thought it to be correct.
 
  By the way, when you say If I specify [\x89-\x91], does it
  mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused.
  We mean tr/\x89-\x91/X/.
 
 
  We are first informed by you that gapped characters are not
  substituted with X by tr/\x89-\x91/X/.
  And you said s/[\x89-\x91]/X/g substituted all the characters
  including gapped characters with X, hadn't you?
 
  Yes.
  If so, I assume your [\x89-\x91] which doesn't matching any of
  the gapped characters to be tr/\x89-\x91/X/.
  That's correct. We mean tr/\x89-\x91/X/.
 
 
  The following is a part of the current core tests from op/pat.t.
  I believe they should be passed.
  Yes all the following tests pass. I think the following tests are in the 
  context of the
  s/[]/X/ operator and hence pass.
 
  Thanks,
 
  Rajarshi.
 
 OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem.
 Since I don't have any EBCDIC machine, I can't ensure the following
 patch will really makes sense.
 
 Regards,
 SADAHIRO Tomoyuki
 
 ! t/op/tr.t, toke.t
 
 diff -ur perl~/t/op/tr.t perl/t/op/tr.t
 --- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005
 +++ perl/t/op/tr.t  Thu Aug 11 23:41:22 2005
 @@ -295,18 +295,15 @@
  # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them,
  # from Karsten Sperling.
 
 -# Not working in EBCDIC as of 12674.
  $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
  is($c, 8);
  is($a, );
 -
 -# Not working in EBCDIC as of 12674.
 +
  $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/;
  is($c, 8);
  is($a, );
 
 -
 -SKIP: {
 +SKIP: {
 skip not EBCDIC, 4 unless $Is_EBCDIC;
 
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
 diff -ur perl~/toke.c perl/toke.c
 --- perl~/toke.cMon Jul 18 04:31:02 2005
 +++ perl/toke.c Thu Aug 11 22:55:18 2005
 @@ -1368,6 +1368,9 @@
 I32  has_utf8 = FALSE; /* Output constant is UTF8 */
 I32  this_utf8 = UTF;  /* The source string is 
 assumed to be UTF8 */
 UV uv;
 +#ifdef EBCDIC
 +UV literal_endpoint = 0;
 +#endif
 
 const char *leaveit =  /* set of acceptably-backslashed characters */
PL_lex_inpat
 @@ -1417,8 +1420,9 @@
 }
 
  #ifdef EBCDIC
 -   if ((isLOWER(min)  isLOWER(max)) ||
 -   (isUPPER(min)  isUPPER(max))) {
 +   if (literal_endpoint == 2 
 +   ((isLOWER(min)  isLOWER(max)) ||
 +(isUPPER(min)  isUPPER(max {
if (isLOWER(min)) {
for (i = min; i = max; i++)
if (isLOWER(i))
 @@ -1437,6 +1441,9 @@
/* mark the range as done, and continue */
dorange = FALSE;
didrange = TRUE;
 +#ifdef EBCDIC
 +   literal_endpoint = 0;
 +#endif
continue;
}
 
 @@ -1455,6 +1462,9 @@
}
else {
didrange = FALSE;
 +#ifdef EBCDIC
 +   literal_endpoint = 0;
 +#endif
}
}
 
 @@ -1788,6 +1798,10 @@
s++;
continue;
} /* end if (backslash) */
 +#ifdef EBCDIC
 +   else
 +   literal_endpoint++;
 +#endif
 
 default_action:
/* If we started with encoded form, or already know we want it
 ###END OF PATCH

Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-11 Thread rajarshi das
Hi,
This is Rajarshi expressing Sastry's viewpoints since he's on vacation. 
 


SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:

On Wed, 10 Aug 2005 14:06:56 +0530, Sastry wrote
 
   As suggested by you, I ran the following script which resulted in
   substituting all the characters with X irrespective of the special
   case [i-j].
  
   ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
   is($a, );

  +++quote begin
  REGULAR EXPRESSION DIFFERENCES
  As of perl 5.005_03 the letter range regular expression such as [A-Z]
  and [a-z] have been especially coded to not pick up gap characters.
  For example, characters such as o WITH CIRCUMFLEX that lie between I
  and J would not be matched by the regular expression range /[H-K]/.
  This works in the other direction, too, if either of the range end
  points is explicitly numeric: [\x89-\x91] will match \x8e, even though
  \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
  viewpoint.
 If I specify [\x89-\x91] it just matches the end characters (i,j)
 and doesn't match any of the gapped characters( including \x8e),
 unlike what you had mentioned.
 Is this correct? 
 -Sastry

According to the above statement in perlebcdic.pod,
s/[\x89-\x91]/X/g must substitute \x8e with X.
But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
with X or not, since tr/// does not use brackets, [ ].

Though I think ranges in [ ] and ranges in tr/// should coincide
and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
that is just my opinion.
I don't know whether it is true and correct.
Is there some way we can confirm if this is correct (and expected behaviour) 
since there isnt any explicit documentation for the tr operator ? 


By the way, when you say If I specify [\x89-\x91], does it
mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused.
We mean tr/\x89-\x91/X/.


We are first informed by you that gapped characters are not
substituted with X by tr/\x89-\x91/X/.
And you said s/[\x89-\x91]/X/g substituted all the characters
including gapped characters with X, hadn't you? 

Yes.
If so, I assume your [\x89-\x91] which doesn't matching any of
the gapped characters to be tr/\x89-\x91/X/.
That's correct. We mean tr/\x89-\x91/X/.


The following is a part of the current core tests from op/pat.t.
I believe they should be passed.
Yes all the following tests pass. I think the following tests are in the 
context of the 

s/[]/X/ operator and hence pass. 

Thanks,

Rajarshi.


Regards,
SADAHIRO Tomoyuki

+++begin
# The 242 and 243 go with the 244 and 245.
# The trick is that in EBCDIC the explicit numeric range should match
# (as also in non-EBCDIC) but the explicit alphabetic range should not match.

if (\x8e =~ /[\x89-\x91]/) {
print ok 242\n;
} else {
print not ok 242\n;
}

if (\xce =~ /[\xc9-\xd1]/) {
print ok 243\n;
} else {
print not ok 243\n;
}

# In most places these tests would succeed since \x8e does not
# in most character sets match 'i' or 'j' nor would \xce match
# 'I' or 'J', but strictly speaking these tests are here for
# the good of EBCDIC, so let's test these only there.
if (ord('i') == 0x89  ord('J') == 0xd1) { # EBCDIC
if (\x8e !~ /[i-j]/) {
print ok 244\n;
} else {
print not ok 244\n;
}
if (\xce !~ /[I-J]/) {
print ok 245\n;
} else {
print not ok 245\n;
}
} else {
for (244..245) {
print ok $_ # Skip: only in EBCDIC\n;
}
}
---end








__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

[PATCH] Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-11 Thread SADAHIRO Tomoyuki

On Wed, 10 Aug 2005 23:56:31 -0700 (PDT), rajarshi das [EMAIL PROTECTED] wrote

 Hi,
 This is Rajarshi expressing Sastry's viewpoints since he's on vacation. 
 
 SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 
 According to the above statement in perlebcdic.pod,
 s/[\x89-\x91]/X/g must substitute \x8e with X.
 But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
 with X or not, since tr/// does not use brackets, [ ].
 
 Though I think ranges in [ ] and ranges in tr/// should coincide
 and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
 that is just my opinion.
 I don't know whether it is true and correct.
 Is there some way we can confirm if this is correct (and expected behaviour)
 since there isnt any explicit documentation for the tr operator ?

Since t/op/tr.t already has a test case (cf. Change 9038)
which Sastry previously pointed out its failing on EBCDIC Platform,
I assume that at least the then pumpking thought it to be correct.

 By the way, when you say If I specify [\x89-\x91], does it
 mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ? I'm confused.
 We mean tr/\x89-\x91/X/.
 
 
 We are first informed by you that gapped characters are not
 substituted with X by tr/\x89-\x91/X/.
 And you said s/[\x89-\x91]/X/g substituted all the characters
 including gapped characters with X, hadn't you? 
 
 Yes.
 If so, I assume your [\x89-\x91] which doesn't matching any of
 the gapped characters to be tr/\x89-\x91/X/.
 That's correct. We mean tr/\x89-\x91/X/.
 
 
 The following is a part of the current core tests from op/pat.t.
 I believe they should be passed.
 Yes all the following tests pass. I think the following tests are in the 
 context of the 
 s/[]/X/ operator and hence pass. 
 
 Thanks,
 
 Rajarshi.

OK. To me, it is confirmed that s/[]/X/ is fine and tr/// has a problem.
Since I don't have any EBCDIC machine, I can't ensure the following
patch will really makes sense.

Regards,
SADAHIRO Tomoyuki

! t/op/tr.t, toke.t

diff -ur perl~/t/op/tr.t perl/t/op/tr.t
--- perl~/t/op/tr.t Mon Aug 01 17:17:24 2005
+++ perl/t/op/tr.t  Thu Aug 11 23:41:22 2005
@@ -295,18 +295,15 @@
 # (i-j, r-s, I-J, R-S), [\x89-\x91] [\xc9-\xd1] has to match them,
 # from Karsten Sperling.
 
-# Not working in EBCDIC as of 12674.
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
 is($c, 8);
 is($a, );
-   
-# Not working in EBCDIC as of 12674.
+
 $c = ($a = \xc9\xca\xcb\xcc\xcd\xcf\xd0\xd1) =~ tr/\xc9-\xd1/X/;
 is($c, 8);
 is($a, );
 
-
-SKIP: {   
+SKIP: {
 skip not EBCDIC, 4 unless $Is_EBCDIC;
 
 $c = ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/i-j/X/;
diff -ur perl~/toke.c perl/toke.c
--- perl~/toke.cMon Jul 18 04:31:02 2005
+++ perl/toke.c Thu Aug 11 22:55:18 2005
@@ -1368,6 +1368,9 @@
 I32  has_utf8 = FALSE; /* Output constant is UTF8 */
 I32  this_utf8 = UTF;  /* The source string is assumed 
to be UTF8 */
 UV uv;
+#ifdef EBCDIC
+UV literal_endpoint = 0;
+#endif
 
 const char *leaveit =  /* set of acceptably-backslashed characters */
PL_lex_inpat
@@ -1417,8 +1420,9 @@
 }
 
 #ifdef EBCDIC
-   if ((isLOWER(min)  isLOWER(max)) ||
-   (isUPPER(min)  isUPPER(max))) {
+   if (literal_endpoint == 2 
+   ((isLOWER(min)  isLOWER(max)) ||
+(isUPPER(min)  isUPPER(max {
if (isLOWER(min)) {
for (i = min; i = max; i++)
if (isLOWER(i))
@@ -1437,6 +1441,9 @@
/* mark the range as done, and continue */
dorange = FALSE;
didrange = TRUE;
+#ifdef EBCDIC
+   literal_endpoint = 0;
+#endif
continue;
}
 
@@ -1455,6 +1462,9 @@
}
else {
didrange = FALSE;
+#ifdef EBCDIC
+   literal_endpoint = 0;
+#endif
}
}
 
@@ -1788,6 +1798,10 @@
s++;
continue;
} /* end if (backslash) */
+#ifdef EBCDIC
+   else
+   literal_endpoint++;
+#endif
 
 default_action:
/* If we started with encoded form, or already know we want it
###END OF PATCH




Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-10 Thread Sastry
On 8/9/05, SADAHIRO Tomoyuki [EMAIL PROTECTED] wrote:
 Hello,
 
 On Tue, 9 Aug 2005 15:09:42 +0530, Sastry [EMAIL PROTECTED] wrote
  Hi
 
  As suggested by you, I ran the following script which resulted in
  substituting all the characters with X irrespective of the special
  case [i-j].
 
  ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
  is($a, );
 
 Right, that behavior of ranges in character classes [ ] is expectable
 from literal_endpoint, which is introduced by Change 16556.
 
 cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556
 
  I have also observed that whenever there are any gapped characters eg:
  [r-s] as in the following script, it just translates 'r' and 's' to X
  alone!
 
  ($a = \x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2) =~ tr/\x99-\xa2/X/;
  is($a, XX);
 
  a) Why is it mentioned that when [i-j] is included [\x89-\x91] should
  not be included?
  b) Do you think there is a bug in the tr// implementation as a
  consequence of the above?
 
  -Sastry
 
 Answer for a) is mentioned in perlebcdic.pod.
 The last sentence (This works in...) seems to be added there
 in accompanied with Change 16556 as above.
 
 +++quote begin
 REGULAR EXPRESSION DIFFERENCES
 As of perl 5.005_03 the letter range regular expression such as [A-Z]
 and [a-z] have been especially coded to not pick up gap characters.
 For example, characters such as o WITH CIRCUMFLEX that lie between I
 and J would not be matched by the regular expression range /[H-K]/.
 This works in the other direction, too, if either of the range end
 points is explicitly numeric: [\x89-\x91] will match \x8e, even though
 \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
 viewpoint.
If I specify  [\x89-\x91]  it just matches the end characters (i,j)
and doesn't match any of the gapped characters( including \x8e),
unlike what you had mentioned.
Is this correct? 
-Sastry

 quote end
 
 I'll give some additional explanations from the viewpoint
 of portability:
 a letter range [h-k] always means [hijk], even on EBCDIC platforms,
 but not [hi\x8A-\x90jk], because the string h is always the small
 letter 'h' whether its code value is 0x68 or 0x88;
 thus a numeric range [\x89-\x91] should always mean
 [\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms,
 but not [\x89\x91], because the string \x89 always stands for
 the code value 0x89 whether it encodes a certain C1 control character
 or the letter 'i'.
 
 b): In my opinion the above change in [  ] for regular expressions
 is an improvement and a similar change in tr/// is also advisable.
 
 The reason why I hesitate to use the word bug is based on
 the following statement on tr/// in perlop.pod, esp. the last sentence:
 
 +++quote begin
 Note also that the whole range idea is rather unportable between
 character sets--and even within character sets they may cause results
 you probably didn't expect. A sound principle is to use only ranges
 that begin from and end at either alphabets of equal case (a-e, A-E),
 or digits (0-4). Anything else is unsafe. If in doubt, spell out
 the character sets in full.
 quote end
 
 where numeric ranges such as \x89-\x91 are not declared
 to be safe, but to be unsafe.
 
 Regards,
 SADAHIRO Tomoyuki
 
 



Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-10 Thread SADAHIRO Tomoyuki

On Wed, 10 Aug 2005 14:06:56 +0530, Sastry [EMAIL PROTECTED] wrote
 
   As suggested by you, I ran the following script which resulted in
   substituting all the characters with X irrespective of the special
   case [i-j].
  
   ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
   is($a, );

  +++quote begin
  REGULAR EXPRESSION DIFFERENCES
  As of perl 5.005_03 the letter range regular expression such as [A-Z]
  and [a-z] have been especially coded to not pick up gap characters.
  For example, characters such as o WITH CIRCUMFLEX that lie between I
  and J would not be matched by the regular expression range /[H-K]/.
  This works in the other direction, too, if either of the range end
  points is explicitly numeric: [\x89-\x91] will match \x8e, even though
  \x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
  viewpoint.
 If I specify  [\x89-\x91]  it just matches the end characters (i,j)
 and doesn't match any of the gapped characters( including \x8e),
 unlike what you had mentioned.
 Is this correct? 
 -Sastry

According to the above statement in perlebcdic.pod,
s/[\x89-\x91]/X/g must substitute \x8e with X.
But it doesn't concern whether tr/\x89-\x91/X/ would substitute \x8e
with X or not, since tr/// does not use brackets, [ ].

Though I think ranges in [ ] and ranges in tr/// should coincide
and agree that tr/\x89-\x91/X/ should substitute \x8e with X,
that is just my opinion.
I don't know whether it is true and correct.

By the way, when you say If I specify  [\x89-\x91], does it
mean s/[\x89-\x91]/X/g or tr/\x89-\x91/X/ ?  I'm confused.

We are first informed by you that gapped characters are not
substituted with X by tr/\x89-\x91/X/.
And you said s/[\x89-\x91]/X/g substituted all the characters
including gapped characters with X, hadn't you?
If so, I assume your [\x89-\x91] which doesn't matching any of
the gapped characters to be tr/\x89-\x91/X/.

The following is a part of the current core tests from op/pat.t.
I believe they should be passed.

Regards,
SADAHIRO Tomoyuki

+++begin
# The 242 and 243 go with the 244 and 245.
# The trick is that in EBCDIC the explicit numeric range should match
# (as also in non-EBCDIC) but the explicit alphabetic range should not match.

if (\x8e =~ /[\x89-\x91]/) {
  print ok 242\n;
} else {
  print not ok 242\n;
}

if (\xce =~ /[\xc9-\xd1]/) {
  print ok 243\n;
} else {
  print not ok 243\n;
}

# In most places these tests would succeed since \x8e does not
# in most character sets match 'i' or 'j' nor would \xce match
# 'I' or 'J', but strictly speaking these tests are here for
# the good of EBCDIC, so let's test these only there.
if (ord('i') == 0x89  ord('J') == 0xd1) { # EBCDIC
  if (\x8e !~ /[i-j]/) {
print ok 244\n;
  } else {
print not ok 244\n;
  }
  if (\xce !~ /[I-J]/) {
print ok 245\n;
  } else {
print not ok 245\n;
  }
} else {
  for (244..245) {
print ok $_ # Skip: only in EBCDIC\n;
  }
}
---end







Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-09 Thread SADAHIRO Tomoyuki
Hello,

On Tue, 9 Aug 2005 15:09:42 +0530, Sastry [EMAIL PROTECTED] wrote
 Hi
 
 As suggested by you, I ran the following script which resulted in
 substituting all the characters with X irrespective of the special 
 case [i-j].

 ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
 is($a, );

Right, that behavior of ranges in character classes [ ] is expectable
from literal_endpoint, which is introduced by Change 16556.

cf. http://public.activestate.com/cgi-bin/perlbrowse?patch=16556

 I have also observed that whenever there are any gapped characters eg:
 [r-s] as in the following script, it just translates 'r' and 's' to X 
 alone!
 
 ($a = \x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2) =~ tr/\x99-\xa2/X/;
 is($a, XX);

 a) Why is it mentioned that when [i-j] is included [\x89-\x91] should
 not be included?
 b) Do you think there is a bug in the tr// implementation as a
 consequence of the above?
 
 -Sastry 

Answer for a) is mentioned in perlebcdic.pod.
The last sentence (This works in...) seems to be added there
in accompanied with Change 16556 as above.

+++quote begin
REGULAR EXPRESSION DIFFERENCES
As of perl 5.005_03 the letter range regular expression such as [A-Z]
and [a-z] have been especially coded to not pick up gap characters.
For example, characters such as o WITH CIRCUMFLEX that lie between I
and J would not be matched by the regular expression range /[H-K]/.
This works in the other direction, too, if either of the range end
points is explicitly numeric: [\x89-\x91] will match \x8e, even though
\x89 is i and \x91 is j, and \x8e is a gap character from the alphabetic
viewpoint.
quote end

I'll give some additional explanations from the viewpoint
of portability:
a letter range [h-k] always means [hijk], even on EBCDIC platforms,
but not [hi\x8A-\x90jk], because the string h is always the small
letter 'h' whether its code value is 0x68 or 0x88;
thus a numeric range [\x89-\x91] should always mean
[\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91] even on EBCDIC platforms,
but not [\x89\x91], because the string \x89 always stands for
the code value 0x89 whether it encodes a certain C1 control character
or the letter 'i'.

b): In my opinion the above change in [  ] for regular expressions
is an improvement and a similar change in tr/// is also advisable.

The reason why I hesitate to use the word bug is based on
the following statement on tr/// in perlop.pod, esp. the last sentence:

+++quote begin
Note also that the whole range idea is rather unportable between
character sets--and even within character sets they may cause results
you probably didn't expect. A sound principle is to use only ranges
that begin from and end at either alphabets of equal case (a-e, A-E),
or digits (0-4). Anything else is unsafe. If in doubt, spell out
the character sets in full.
quote end

where numeric ranges such as \x89-\x91 are not declared
to be safe, but to be unsafe.

Regards,
SADAHIRO Tomoyuki




Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-08 Thread Nicholas Clark
On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote:
 Hi
 
 I am trying to run this script on an EBCDIC platform using perl-5.8.6
  
 ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
 is($a, );
 
 
 The result I get is 
 
  'X«»ðý±°X'
 
 a) Is this happening  since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped
 characters in EBCDIC ?

I think so. In that \x89 is 'i' and \x91 is 'j'.


 b) Should all the bytes in $a change to X?

I don't know. It seems to be some special case code in regexec.c:

#ifdef EBCDIC
/* In EBCDIC [\x89-\x91] should include
 * the \x8e but [i-j] should not. */
if (literal_endpoint == 2 
((isLOWER(prevvalue)  isLOWER(ceilvalue)) ||
 (isUPPER(prevvalue)  isUPPER(ceilvalue
{
if (isLOWER(prevvalue)) {
for (i = prevvalue; i = ceilvalue; i++)
if (isLOWER(i))
ANYOF_BITMAP_SET(ret, i);
} else {
for (i = prevvalue; i = ceilvalue; i++)
if (isUPPER(i))
ANYOF_BITMAP_SET(ret, i);
}
}
else
#endif


which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not.
I don't know where ranges in tr/// are parsed, but given that I grepped
for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91//
is treated as tr/i-j// and in turn i-j is treated as letters and always
special cased

I don't know if tr/i-j// and tr/\x89-\x91// should behave differently
(ie whether we currently have a bug)

Nicholas Clark


Re: Transliteration operator(tr//)on EBCDIC platform

2005-08-08 Thread SADAHIRO Tomoyuki

On Mon, 8 Aug 2005 15:36:40 +0100, Nicholas Clark [EMAIL PROTECTED] wrote

 On Thu, Aug 04, 2005 at 11:42:54AM +0530, Sastry wrote:
  Hi
  
  I am trying to run this script on an EBCDIC platform using perl-5.8.6
   
  ($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
  is($a, );
  
  
  The result I get is 
  
   'X«»ðý±°X'
  
  a) Is this happening  since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped
  characters in EBCDIC ?
 
 I think so. In that \x89 is 'i' and \x91 is 'j'.
 
 
  b) Should all the bytes in $a change to X?
 
 I don't know. It seems to be some special case code in regexec.c:
 
 #ifdef EBCDIC
   /* In EBCDIC [\x89-\x91] should include
* the \x8e but [i-j] should not. */
   if (literal_endpoint == 2 
   ((isLOWER(prevvalue)  isLOWER(ceilvalue)) ||
(isUPPER(prevvalue)  isUPPER(ceilvalue
   {
   if (isLOWER(prevvalue)) {
   for (i = prevvalue; i = ceilvalue; i++)
   if (isLOWER(i))
   ANYOF_BITMAP_SET(ret, i);
   } else {
   for (i = prevvalue; i = ceilvalue; i++)
   if (isUPPER(i))
   ANYOF_BITMAP_SET(ret, i);
   }
   }
   else
 #endif
 
 
 which I assume is making [i-j] in a regexp leave a gap, but [\x89-\x91] not.
 I don't know where ranges in tr/// are parsed, but given that I grepped
 for EBCDIC and didn't find any analogous code, it looks like tr/\x89-\x91//
 is treated as tr/i-j// and in turn i-j is treated as letters and always
 special cased

S_scan_const() in toke.c seems to expand ranges in tr///,
while S_regclass() in regcomp.c (what I assume you mean) copes
with those in []. 

 from toke.c, line 1419
#ifdef EBCDIC
if ((isLOWER(min)  isLOWER(max)) ||
(isUPPER(min)  isUPPER(max))) {
if (isLOWER(min)) {
for (i = min; i = max; i++)
if (isLOWER(i))
*d++ = NATIVE_TO_NEED(has_utf8,i);
} else {
for (i = min; i = max; i++)
if (isUPPER(i))
*d++ = NATIVE_TO_NEED(has_utf8,i);
}
}
else
#endif

The former doesn't have thing like literal_endpoint in the latter;
thus tr/// seem not to tell literals from metacharacters in ranges
and tr/\x89-\x91/X/ will not replace \x8e in EBCDIC.

Hmm, it may be a possible inconsistency in the case of EBCDIC.
Sastry, would you please do the following codelet on your EBCDIC?

($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ s/[\x89-\x91]/X/g;
 is($a, );

Does that work similarly to yours?
($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
 is($a, );

Regards,
SADAHIRO Tomoyuki




Transliteration operator(tr//)on EBCDIC platform

2005-08-04 Thread Sastry
Hi

I am trying to run this script on an EBCDIC platform using perl-5.8.6
 
($a = \x89\x8a\x8b\x8c\x8d\x8f\x90\x91) =~ tr/\x89-\x91/X/;
is($a, );


The result I get is 

 'X«»ðý±°X'

a) Is this happening  since \x8a\x8b\x8c\x8d\x8f\x90 are the gapped
characters in EBCDIC ?
or 
b) Should all the bytes in $a change to X?

Thanks in advance
Sastry