I'm having trouble using 5.8.0 Encode with the MacArabic code table. (It took a long time to figure out the cause, and I still don't understand where Encode gets/keeps its info about character mappings.)
The problem affects all the points in the MacArabic table whose Unicode correlates include the "<LR>+" or "<RL>+" indicators -- e.g. (quoting from the MAC/ARABIC.TXT listing available from www.unicode.org): #======================================================================= # FTP file name: ARABIC.TXT # # Contents: Map (external version) from Mac OS Arabic # character set to Unicode 2.1 # # Copyright: (c) 1994-1999 by Apple Computer, Inc., all rights # reserved. ... 0x20 <LR>+0x0020 # SPACE, left-right 0x21 <LR>+0x0021 # EXCLAMATION MARK, left-right 0x22 <LR>+0x0022 # QUOTATION MARK, left-right ... 0x81 <RL>+0x00A0 # NO-BREAK SPACE, right-left ... 0x8C <RL>+0x00AB # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK, right-left ... 0xA0 <RL>+0x0020 # SPACE, right-left 0xA1 <RL>+0x0021 # EXCLAMATION MARK, right-left 0xA2 <RL>+0x0022 # QUOTATION MARK, right-left 0xA3 <RL>+0x0023 # NUMBER SIGN, right-left 0xA4 <RL>+0x0024 # DOLLAR SIGN, right-left ... I'll attach a code snippet below to demonstrate (it can operate as a self-contained program), together with the output of "perl -V" on my system (in case that helps). I understand that Mac developers would consider a conversion to unicode "lossy" or "non-reversible" if the directionality indicators are not preserved somehow (using RLE/LRE or RLO/LRO), and this might constitute an "algorithmic" approach that 'enc2xs' would not support. Is there a work-around that will allow all the MacArabic code points to be converted successfully, given that their respective character semantics are all well established in unicode? Even a "lossy" conversion (ditching the directionality specs) would be better than the failures I'm getting now. ----------- David Graff Linguistic Data Consortium [EMAIL PROTECTED] 3600 Market St., Suite 810 voice: (215) 898-0887 University of Pennsylvania fax: (215) 573-2175 Philadelphia, PA 19104 http://www.ldc.upenn.edu --------------- perl -V output: Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: Platform: osname=solaris, osvers=2.8, archname=sun4-solaris uname='sunos follicle.seas.upenn.edu 5.8 generic_108528-09 sun4u sparc sunw,sun-blade-1000 ' config_args='-Dcc=gcc -Dprefix=/pkg/p/perl-5.8.0' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=y, bincompat5005=undef Compiler: cc='gcc', ccflags ='-fno-strict-aliasing -I/usr/local/include -I/pkg/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O', cppflags='-fno-strict-aliasing -I/usr/local/include -I/pkg/include' ccversion='', gccversion='2.95.2 19991024 (release)', gccosandvers='solaris2.7' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='gcc', ldflags ='-L/usr/local/lib -R/usr/local/lib ' libpth=/usr/local/lib /usr/lib /usr/ccs/lib /pkg/lib libs=-lsocket -lnsl -lgdbm -ldl -lm -lc perllibs=-lsocket -lnsl -ldl -lm -lc libc=, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' ' cccdlflags='-fPIC -L/pkg/lib -R/pkg/lib -I/pkg/include', lddlflags='-G -L/usr/local/lib -R/usr/local/lib -I/pkg/lib' Characteristics of this binary (from libperl): Compile-time options: USE_LARGE_FILES Built under solaris Compiled at Sep 23 2002 15:26:38 @INC: /pkg/p/perl-5.8.0/lib/5.8.0/sun4-solaris /pkg/p/perl-5.8.0/lib/5.8.0 /pkg/p/perl-5.8.0/lib/site_perl/5.8.0/sun4-solaris /pkg/p/perl-5.8.0/lib/site_perl/5.8.0 /pkg/p/perl-5.8.0/lib/site_perl .
use strict; use Encode; my ($octet_out, $utf8_out); my @octet_in; push @octet_in, chr($_) for (0x20 .. 0x7E, 0x80 .. 0xFF); # Show that Encode functions are working for some vendor tables: foreach my $table ( qw/cp1256 MacArabic/ ) { my @fail = (); my @succ = (); my @msgs = (); foreach ( @octet_in ) { my $char = $_; eval "\$utf8_out = decode( \'$table\', \$char, Encode::FB_CROAK )"; if ( $@ ) { push @fail, $_; push @msgs, $@; } else { push @succ, $utf8_out; } } print join( ' ', "decoding via $table succeeds on:", (@succ) ? @succ : "nothing"), $/; print join( ' ', "decoding via $table fails on:", (@fail) ? @fail : "nothing"), $/; print STDERR @msgs; }