range operator vs. unicode
Porters, I found that ('a'..'z') works only for alphanumerals. Try the code below; use strict; use warnings; #use utf8; use charnames ':full'; binmode STDOUT, ':utf8'; # works print $_\n for (\N{LATIN CAPITAL LETTER A} .. \N{LATIN CAPITAL LETTER Z}); # (0..9, 'A'..'Z', 'a'..'z'); symbols skipped print $_\n for (\N{DIGIT ZERO} .. \N{LATIN SMALL LETTER Z}); # does not work print $_\n for (\N{LATIN SMALL LETTER A} .. \N{LEFT CURLY BRACKET}); print $_\n for (\N{NO-BREAK SPACE} .. \N{LATIN SMALL LETTER Y WITH DIAERESIS}); print $_\n for (\N{GREEK CAPITAL LETTER ALPHA} .. \N{GREEK CAPITAL LETTER OMEGA}); print $_\n for (\N{KATAKANA LETTER SMALL A} .. \N{KATAKANA LETTER VO}) __END__ There is an easy workaround, however. my @katakana = map { chr } (\N{KATAKANA LETTER SMALL A} .. \N {KATAKANA LETTER VO}); Since we have a workaround above, I don't consider this range implementation is a bug -- after all we would be rather surprised if ('\x0' .. '\x{10}') worked. But the following should be fixed so greeks are not confused with the consequence of (\N{GREEK CAPITAL LETTER ALPHA} .. \N{GREEK CAPITAL LETTER OMEGA}), japanese are not confused with (\N{KATAKANA LETTER SMALL A} .. \N{KATAKANA LETTER VO}) and so forth. perldoc perlop The range operator (in list context) makes use of the magical auto- increment algorithm if the operands are strings. You can say @alphabet = ('A' .. 'Z'); to get all normal letters of the English alphabet, or $hexdigit = (0 .. 9, 'a' .. 'f')[$num 15]; to get a hexadecimal digit, or @z2 = ('01' .. '31'); print $z2[$mday]; to get dates with leading zeros. If the final value specified is not in the sequence that the magical increment would produce, the sequence goes until the next value would be longer than the final value speci- fied. Dan the Man with Too Many Characters to Squeeze in the Range
Re: range operator vs. unicode
On Jun 08, 2006, at 17:34 , Yitzchak Scott-Thoennes wrote: Which part should be fixed? The limitation of the magic, namely The key part is that magical auto-increment is defined earlier as only working for strings matching /^[a-zA-Z]*[0-9]*\z/. Which is described in Auto-increment and Auto-decrement, though Range Operator does mention. perldoc perlop The range operator (in list context) makes use of the magical auto- increment algorithm if the operands are strings. This would make lawyers happy enough but not (Uni)?coders like myself. With the advent of Unicode support more people would attempt things like (\N{alpha} .. \N{omega}) and wonder why it does not work like (a..z). So we should add something like; =head2 CAVEAT Note that the range operator cannot apply magic beyond C[a-zA-Z0-9] . Therefore use charnames 'greek'; my @greek_small = (\N{alpha} .. \N{omega}); Does not work. If you want non-ascii ranges, try my @greek_small = map { chr } ( ord(\N{alpha}) .. ord(\N {omega}) ); On the other hand, ranges in regexp and Ctr/// works. You may consider this inconsistent but range operator must accept variables like tt($start .. $end)/tt while character ranges in regexp is constant. =cut Dan the Range (?:Ar)ranger
[Encode] 2.16 released!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Porters, I just released Encode 2.16 as follows. In terms of codes it is virtually no different from 2.15 but it contains two important non- code fixes. First, it addresses the absence of COPYRIGHT section. Since Encode is part of core and I felt owe my fellows too much credit to claim such, I kept leaving that part intentionally blank -- till I got this. Second, I perltidy-ed all *.pm's Encode has accepted so many patches and in a course of doing so, it kinda turned into a grab-bag of coding styles. I reckoned time is high that I applied good practices. The only difference from the perltidy default is -l=76; I so did because that's what MIME header uses. Ticket URL: http://rt.cpan.org/Ticket/Display.html?id=19056 There is no license in the Encode package. It is not clear under what basis CPAN has permission to distribute the module. So I finally put my name on COPYRIGHT while adding this disclaimer to MAINTAINER section. While Dan Kogai retains the copyright as a maintainer, the credit should go to all those involoved. See AUTHORS for those submitted codes. If any of you listed on AUTHORS section and want your name added to COPYRIGHT, you are welcome. =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.16.tar.gz and CPAN near you. =head1 Changes $Revision: 2.16 $ $Date: 2006/05/03 18:24:10 $ ! bin/piconv --xmlcref and --htmlcref added. ! Encode.pm Copyright Notice Added. http://rt.cpan.org/NoAuth/Bug.html?id=#19056 ! * Replaced remaining ^\t with q( ) x 4. -- Perl Best Practice pp. 20 And all .pm's are now perltidy-ed. =for Maintperl Encode remains 2.12 there but I consider the current version mature enough for maint. Nicholas, would to consider doing so? Dan the Encode Maintainer -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.3 (Darwin) iD8DBQFEWPxgErJia/WXtBsRAi9qAJ4/+Bye2aZbCScV3rIBFvzYoJpD7ACgoMd9 bM43uLFvZ6q7yOsnO0/pGcw= =jRwd -END PGP SIGNATURE-
Re: \p{IsBogus} vs. exception
On Mar 07, 2006, at 01:45 , Yitzchak Scott-Thoennes wrote: So the property is only checked for validity at the point when it is actually used. I'm not sure it would even be desirable to check it before then (that is, at regcomp-time), remembering that Perl is a dynamic language. Maybe too dynamic :) Not many people would expect \p{IsBogus} is completely ignored where. 'str' =~ / \p{IsBogus}/; But in cases like $str = $ARGV[0]; $str =~ / \p{IsBogus}/; the code may or may not raise an exception and that's somewhat tricky. On Mar 07, 2006, at 01:45 , [EMAIL PROTECTED] wrote: This all looks perfectly consistent to me: the expensive work of looking up the property is not done until matching actually gets to that point. Thanks. Sounds reasonable to me, too. I would not call this a bug (not sure if you were suggesting that it is) - if you need to check whether a property is bogus, your example has_unicode_property is fine. But it would not be unreasonable for utf8.pm (or something) to provide a function that delivers the same information. I agree. Dan the Perl5 Porter
Re: [Encode::Guess] ambiguous result !?
Christophe, Thanks for your mail. So far Encode::Guess does not guess the encoding for filehandle not because it is impossible but because Encode::Guess takes a very conservative -- even paranoia -- approach. For example, Ambiguity raises exception, not preferred encoding. One way to enable guessing filehandle goes like this. sub open_and_guess{ my $filename = shift; open my $fh, :raw, $filename or return; # or die if you like my $head = $fh; # may not work for UTF-(16|32); my $enc = guess_encoding($head); ref $enc or die $enc; # or return $enc my $encname = $enc-name; seek $fh, 0, 0; binmode $fh, :encoding($encname); return $fh; } Here we open, guess, and reopen but this does not work for general case; Not all files are seekable and reopenable (i.e. pipes and sockets). To know more about Encode::Guess, try http://search.cpan.org/~dankogai/Encode-2.12/lib/Encode/Guess.pm Yours, Dan the Encode Maintainer On Oct 24, 2005, at 19:12 , HERMIER Christophe wrote: Hello, I am using the Encode::Guess module to detect the encoding of a file before opening it. Basically I believe that I can have various sorts of unicode encodings or latin-1. What I want to get is an encoding string to give back to open. my code goes like this : my $codage = guess_encoding ( $debut ); if ( UNIVERSAL::isa ( $codage, Encode::utf8 ) ) { $codage = :utf8; } elsif ( UNIVERSAL::isa ( $codage, Encode::Unicode ) ) { $codage = :encoding(utf-16); } else { $codage = :encoding(iso-8859-1); } The problem is with the Encode::Unicode case : I don't know if it is UTF16-LE ou UTF16-BE and it could even be UTF32 Is there a way to know that ??? BTW, I checked your homepage (http://www.dan.co.jp/http:// www.dan.co.jp/) first but it does not seem to work ? Regards, Christophe.
Re: UTF-16LE fails in substitution
On Sep 15, 2005, at 07:05 , Steve Larson wrote: What I want to do is add a version string comment at the beginning of .xml files. I test to see if the file is UNICODE (Encode::Unicode) or ASCII (Encode::XS) using guess_encoding. My ASCII case works fine but the regexp for the UNICODE case fails. Below snippet is the code for the UNICODE case. The answer is that PerlIO does not go well with BOMed UTFs. What you should do instead is to read the whole file first like this; open my $in, :raw, $filename or die $filename : $!; read $in, my $buf, -s $filename; # one of many ways to slurp file. close $in; my $content = decode(UTF16, $buffer); # LE or BE is not required. # # do whatever you want to $content and # open my $out, :raw, $filename or die $filename : $!; print $out encode(UTF16-LE, $buffer); # now be explicit on endianness close $out; Remember UTF-(16|32) does not go well with stream models. Treat it as a binary file. Dan the Encode Maintainer
[Encode] 2.12 Released!
Porters, I am pleased to release Encode Version 2.12 as follows; =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.12.tar.gz and CPAN near you. =head1 Highlight You can finally use coderef to CHECK. coderef for CHECK As of Encode 2.12 CHECK can also be a code reference which takes the ord value of unmapped caharacter as an argument and returns a string that represents the fallback character. For instance, $ascii = encode(ascii, $utf8, sub{ sprintf U+%04X, shift }); Acts like FB_PERLQQ but U+ is used instead of \x{}. =head1 Changes $Revision: 2.12 $ $Date: 2005/09/08 14:17:17 $ ! Encode.xs Encode.pm t/fallback.t Now accepts coderef for CHECK! ! ucm/8859-7.ucm Updated to newer version at unicode.org http://rt.cpan.org/NoAuth/Bug.html?id=14222 ! lib/Encode/Supported.pod More POD typo fixed. [EMAIL PROTECTED] ! encoding.pm More POD typo leftover fixed. Message-Id: [EMAIL PROTECTED] =head1 Signature Dan the Encode Maintainer
Re: intelligent lexically encoding
On Sep 08, 2005, at 11:22 , Jerzy Giergiel wrote: sorry for bugging people here with a trivial question. I need to convert from MacRoman encoding to asci (7-bit). Encode package simply replaces out of range characters with a question mark. I need something intelligent lexically speaking. For example aacute should be converted to a. Any suggestions? Maybe you need to implement your own fallback method. FYI Encode already has fallback methods as follows. $ascii = encode(ascii, $utf8, $fallbacks); where; $fallback is รก (U+00E1) will be Encode::FB_PERLQQ \x{00E1} Encode::HTMLCREF #225; Encode::XMLCREF #xe1; If any of that will suffice, go ahead use it. If it does not, you have go go like this; $ascii = $utf8; $ascii =~ s/([^\x00-\x7f])/your_own_fallback($1)/eg; Hope that helps. Dan the Encode Maintainer
Re: IRI support in URI and URI::Escape modules
On Jan 31, 2005, at 18:19, Martin Duerst wrote: I started with some very simple (I thought) tests, but got completely confused very quickly. Here is the short program that I was using: test.pl use utf8; use URI; use URI::Escape; print (uri_escape(\xFD) [snip] With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread (with 1 registered patch, see perl -V for more detail), I get %FD %C3%BD [snip] However, on perl, v5.8.4 built for i386-linux-thread-multi, I get: %FD [snip] Nothing seems to work anymore, although (or because?) 5.8 has better Unicode support. The (easiest|new canonical) way to go is to use uri_escape_utf8() instead of uri_escape(). Note that as of version 3.28 uri_escape_utf8() is NOT AUTOMATICALLY loaded. % perl -MURI::Escape -le 'print uri_escape(\xFD)' %FD % perl -MURI::Escape=uri_escape_utf8 -le 'print uri_escape_utf8(\xFD)' %C3%BD perldoc URI::Escape uri_escape_utf8( $string ) uri_escape_utf8( $string, $unsafe ) Works like uri_escape(), but will encode chars as UTF-8 before escaping them. This makes this function able do deal with charac- ters with code above 255 in $string. Note that chars in the 128 .. 255 range will be escaped differently by this function compared to what uri_escape() would. For chars in the 0 .. 127 range there is no difference. The call: $uri = uri_escape_utf8($string); will be the same as: use Encode qw(encode); $uri = uri_escape(encode(UTF-8, $string)); but will even work for perl-5.6 for chars in the 128 .. 255 range. Dan the Encode Maintainer
real UTF-8 vs. utf8n_to_uvuni()
On Dec 05, 2004, at 10:56, Dan Kogai wrote: Thanks, applied in my repository. New tests and documentation fix in progress. When I am done w/ that, I will release Encode-2.0901 on my web (not CPAN yet). When cross-checks by porters are done I will release Encode-2.10. Dan the Encode Maintainer Now I am writing test suites and found some of the strictures are missing. Surrogate -- OK % perl -Mblib -MEncode -le '$a=\x{d801}; print encode(UTF-8, $a, 1)' \x{d801} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. U+ -- OK % perl -Mblib -MEncode -le '$a=\x{}; print encode(UTF-8, $a, 1)' \x{} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. Chars above U+10 -- NOT OK % perl -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 1)' Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a problem of perl core. So I have checked utf8.c which defines that. Seems like it does not make use of PERL_UNICODE_MAX. The patch against utf8.c fixes that. ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a=\x{11}; print encode(UTF-8, $a, 1)' \x{00f4} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. As you see, the warning is still funny. But for any case w/ UTF8_WARN_LONG is funny as follows; perl -Mblib -MEncode -le '$a=\x{7fff_}; print encode(UTF-8, $a, 1)' ?? perl -Mblib -MEncode -le '$a=\x{8000_}; print encode(UTF-8, $a, 1)' \x{00fe} does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150. I have tracked down and found this warning was handled by Encode so Gisle and I can fix that. Dan the Encode Maintainer --- perl-5.8.x/utf8.c Wed Nov 17 23:11:04 2004 +++ perl-5.8.x.dan/utf8.c Sun Dec 5 11:38:52 2004 @@ -429,6 +429,13 @@ } else uv = UTF8_ACCUMULATE(uv, *s); + /* Checks if ord() 0x10 -- dankogai */ + if (uv PERL_UNICODE_MAX){ + if (!(flags UTF8_ALLOW_LONG)) { + warning = UTF8_WARN_LONG; + goto malformed; + } + } if (!(uv ouv)) { /* These cannot be allowed. */ if (uv == ouv) {
Re: Make Encode.pm support the real UTF-8
On Dec 04, 2004, at 11:51, Larry Wall wrote: On Fri, Dec 03, 2004 at 10:12:12PM +, Tim Bunce wrote: : I've no problem with 'utf8' being perl's unrestricted uft8 encoding, : but UTF-8 is the name of the standard and should give the : corresponding behaviour. For what it's worth, that's how I've always kept them straight in my head. Also for what it's worth, Perl 6 will mostly default to strict but make it easy to switch back to lax. Larry Okay, Looks like the verdict is reached. 1. utf8 will stay liberal 2. UTF-8 will be strict The rest is mostly implemenation. 2.1. What will the canonnical name of the strict version of UTF-8 be ? Gisle already submitted me a test patch and it uses 'utf-8-strict'. If there is no objection, I would like to use that. 2.2. CAVEAT: UTF8 will be utf8, not utf-8-strict, since Encode aliasing is case insensitive. 2.3. Degree of stricture. How strict are we going to make utf-8-strict? a. simply make use of UTF8_ALLOW_* in utf8.h ? b. unmapped codepoints banned as well? IMHO a. is strict enough since mapped codepoints are subject to increase as Unicode Standard updates. 2.4 We can always make UTF-8 liberal by reapplying alias. Anything else missing? Dan the Encode Maintainer
Re: clearing the utf8 flag
On Nov 10, 2004, at 01:30, Paul Bijnens wrote: I have a program that reads and writes (among others) strings that should be utf8 encoded. I say should, because somewhere deep inside the dark corners of that program, sometimes, the utf8 flag on a string is lost. (I'm still investigating where, tips to attack such a problem are welcome.) Even when you try to set UTF-8 flag on strings which consists entirely of ASCII ( /^[\x00-\x7f]$/ ) the UTF-8 will not be on. See The UTF-8 flag section of 'perldoc Encode'. Here is the short summary. perldoc Encode o When you decode, the resulting utf8 flag is on unless you can unam- biguously represent data. Here is the definition of dis-ambiguity. After $utf8 = decode('foo', $octet);, When $octet is... The utf8 flag in $utf8 is - In ASCII only (or EBCDIC only)OFF In ISO-8859-1 ON In any other Encoding ON - As you see, there is one exception, In ASCII. That way you can assue Goal #1. And with Encode Goal #2 is assumed but you still have to be careful in such cases mentioned in CAVEAT paragraphs. When writing the string, the program clears the utf8 flag and writes a simple string of octets using: $s = encode(utf8, $s) if $s =~ /[^\x00-\x7f]/; $n = length($s); # yes, we need length in bytes ... print $s; If what you need is byte length, you can simply use bytes as follows. binmode is for print(). use bytes (); # avoid imports binmode STDOUT = :utf8; my $s = \x{5c0f}\x{98fc} \x{5f3e}; # ... my $n = length($s);ch my $l = bytes::length($s); # ... print $s; Why would someone test for pure 7-bit strings instead of: $s = encode(utf8, $s) if Encode::is_utf8($s); For most cases you don't have to and you should not have to (unless you maintain Encode and/or perl :). Complex it may be, the internal UTF-8 flag was the best way to harness UTF-8 while keeping legacy, byte-oriented scripts compatible. which seems superior to avoid double utf8 encodings, shoue ld the utf8-flag be lost. And it's faster. Or even simply: Encode::_utf8_off($s) The problem is that I'm usually wrong. Am I this time? Am I missing something? Or do I need more coffee? I have to admit Encode and Perl 5.8-way of handling Unicode needs more recipes (Perl Cookbook 2nd Ed. does cover that issue on Ch. 8 but it was hardly enough). Dan the Encode Maintainer
Re: Help with uc and lc and utf8
On Nov 06, 2004, at 15:21, Robert D Oden wrote: I am not able to lower case a . I am sure I am missing something simple but I have spent many hours researching and trying different things to no avail. Any help would be appreciated!! Make sure: * You have saved your script in UTF-8, not Latin1 * use utf8 to make sure string literals are treated as UTF-8 strings * if you print, set filehandle layer to :utf8. Try the sript below (be sure to save it in UTF-8). I got KThe = kthe. # use strict; use utf8; my $fname = 'KThe'; my $lc_fname = lc($fname); binmode STDOUT = :utf8; print $fname = $lc_fname\n; __END__ Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: You should upgrade it to 5.8.1 or above (5.8.5 being the latest). 5.8.0 was still premature unicode-wise. Dan the Encode Maintainer
Re: Encode-2.07 vs. PerlIO::encoding
On Oct 24, 2004, at 18:34, Rafael Garcia-Suarez wrote: Welcome to backward compatibility hell :) Hell it was but seems like I came up with a way out (yay). I just want Encode::utf8-decode() to make sure Encode:RETURN_ON_ERR is on when the callar is PerlIO::encoding... Or, one could backport PerlIO::encoding (with your patch) to CPAN and require this latest version for Encode 2.08. That was what came across my mind first but I found it was not good enough to coerce Encode:RETURN_ON_ERR since $PerlIO::encoding:fallback is open to the public (even documented!). So far -renew() is only used by PerlIO (and is meaningful only when the object is Encode::Unicode). In other words, you can tell it's PerlIO that is calling you if the object is renewed. The following patch does that. The new Encode::utf8-decode() checks $self-renewed and if so it sets Encode:RETURN_ON_ERR. Here is the patch or you can wait for Encode-2.08. Thankfully Encode::XS needs no real -renew so it is left as is (dummy -renewed() was introduced just to be safe). Dan the Encode Maintainer diff -ruN ext/Encode-2.07/Encode.xs ext/Encode/Encode.xs --- ext/Encode-2.07/Encode.xs Sat Oct 23 04:37:13 2004 +++ ext/Encode/Encode.xsSun Oct 24 20:31:06 2004 @@ -252,14 +252,6 @@ PROTOTYPES: DISABLE void -Method_renew(obj) -SV * obj -CODE: -{ -XSRETURN(1); -} - -void Method_decode_xs(obj,src,check = 0) SV * obj SV * src @@ -270,6 +262,28 @@ U8 *s = (U8 *) SvPV(src, slen); U8 *e = (U8 *) SvEND(src); SV *dst = newSV(slen0?slen:1); /* newSV() abhors 0 -- inaba */ + +/* + * PerlO check -- we assume the object is of PerlIO if renewed + * and if so, we set RETURN_ON_ERR for partial character + */ +int renewed = 0; +dSP; ENTER; SAVETMPS; +PUSHMARK(sp); +XPUSHs(obj); +PUTBACK; +if (call_method(renewed,G_SCALAR) == 1) { + SPAGAIN; + renewed = POPi; + PUTBACK; +#if 0 + fprintf(stderr, renewed == %d\n, renewed); +#endif + if (renewed){ check |= ENCODE_RETURN_ON_ERR; } +} +FREETMPS; LEAVE; +/* end PerlIO check */ + SvPOK_only(dst); SvCUR_set(dst,0); if (SvUTF8(src)) { @@ -397,6 +411,14 @@ { XSRETURN(1); } + +int +Method_renewed(obj) +SV *obj +CODE: +RETVAL = 0; +OUTPUT: +RETVAL void Method_name(obj) diff -ruN ext/Encode-2.07/Unicode/Unicode.pm ext/Encode/Unicode/Unicode.pm --- ext/Encode-2.07/Unicode/Unicode.pm Sat Oct 23 04:37:17 2004 +++ ext/Encode/Unicode/Unicode.pm Sun Oct 24 20:38:16 2004 @@ -46,7 +46,7 @@ my $self = shift; $BOM_Unknown{$self-name} or return $self; my $clone = bless { %$self } = ref($self); -$clone-{clone} = 1; # so the caller knows it is renewed. +$clone-{clone}++ # so the caller knows it is renewed. return $clone; } diff -ruN ext/Encode-2.07/lib/Encode/Encoding.pm ext/Encode/lib/Encode/Encoding.pm --- ext/Encode-2.07/lib/Encode/Encoding.pm Sat Oct 23 04:37:13 2004 +++ ext/Encode/lib/Encode/Encoding.pm Sun Oct 24 20:25:13 2004 @@ -5,6 +5,7 @@ require Encode; +sub DEBUG { 0 } sub Define { my $obj = shift; @@ -16,7 +17,18 @@ sub name { return shift-{'Name'} } -sub renew { return $_[0] } +# sub renew { return $_[0] } + +sub renew { +my $self = shift; +my $clone = bless { %$self } = ref($self); +$clone-{renewed}++; # so the caller can see it +DEBUG and warn $clone-{renewed}; +return $clone; +} + +sub renewed{ return $_[0]-{renewed} || 0 } + *new_sequence = \renew; sub needs_lines { 0 }; @@ -167,24 +179,28 @@ Predefined As: - sub renew { return $_[0] } + sub renew { +my $self = shift; +my $clone = bless { %$self } = ref($self); +$clone-{renewed}++; +return $clone; + } This method reconstructs the encoding object if necessary. If you need to store the state during encoding, this is where you clone your object. -Here is an example: - - sub renew { - my $self = shift; - my $clone = bless { %$self } = ref($self); - $clone-{clone} = 1; # so the caller can see it - return $clone; - } - -Since most encodings are stateless the default behavior is just return -itself as shown above. PerlIO ALWAYS calls this method to make sure it has its own private encoding object. + +=item -Egtrenewed + +Predefined As: + + sub renewed { $_[0]-{renewed} || 0 } + +Tells whether the object is renewed (and how many times). Some +modules emit CUse of uninitialized value in null operation warning +unless the value is numeric so return 0 for false. =item -Egtperlio_ok()
Re: Encode-2.07 vs. PerlIO::encoding
On Oct 24, 2004, at 20:50, Dan Kogai wrote: The following patch does that. The new Encode::utf8-decode() checks $self-renewed and if so it sets Encode:RETURN_ON_ERR. Here is the patch or you can wait for Encode-2.08. One patch to Unicode/Unicode.xs was missing and Unicode/Unicode.pm was garbled. Here we go again, the patch against 2.07. Forget the previous patch. Or wait for Encode-2.08 Dan the Encode Maintainer diff -ruN ext/Encode-2.07/Encode.xs ext/Encode/Encode.xs --- ext/Encode-2.07/Encode.xs Sat Oct 23 04:37:13 2004 +++ ext/Encode/Encode.xsSun Oct 24 20:31:06 2004 @@ -252,14 +252,6 @@ PROTOTYPES: DISABLE void -Method_renew(obj) -SV * obj -CODE: -{ -XSRETURN(1); -} - -void Method_decode_xs(obj,src,check = 0) SV * obj SV * src @@ -270,6 +262,28 @@ U8 *s = (U8 *) SvPV(src, slen); U8 *e = (U8 *) SvEND(src); SV *dst = newSV(slen0?slen:1); /* newSV() abhors 0 -- inaba */ + +/* + * PerlO check -- we assume the object is of PerlIO if renewed + * and if so, we set RETURN_ON_ERR for partial character + */ +int renewed = 0; +dSP; ENTER; SAVETMPS; +PUSHMARK(sp); +XPUSHs(obj); +PUTBACK; +if (call_method(renewed,G_SCALAR) == 1) { + SPAGAIN; + renewed = POPi; + PUTBACK; +#if 0 + fprintf(stderr, renewed == %d\n, renewed); +#endif + if (renewed){ check |= ENCODE_RETURN_ON_ERR; } +} +FREETMPS; LEAVE; +/* end PerlIO check */ + SvPOK_only(dst); SvCUR_set(dst,0); if (SvUTF8(src)) { @@ -397,6 +411,14 @@ { XSRETURN(1); } + +int +Method_renewed(obj) +SV *obj +CODE: +RETVAL = 0; +OUTPUT: +RETVAL void Method_name(obj) diff -ruN ext/Encode-2.07/Unicode/Unicode.pm ext/Encode/Unicode/Unicode.pm --- ext/Encode-2.07/Unicode/Unicode.pm Sat Oct 23 04:37:17 2004 +++ ext/Encode/Unicode/Unicode.pm Sun Oct 24 21:20:22 2004 @@ -46,7 +46,7 @@ my $self = shift; $BOM_Unknown{$self-name} or return $self; my $clone = bless { %$self } = ref($self); -$clone-{clone} = 1; # so the caller knows it is renewed. +$clone-{renewed}++; # so the caller knows it is renewed. return $clone; } diff -ruN ext/Encode-2.07/Unicode/Unicode.xs ext/Encode/Unicode/Unicode.xs --- ext/Encode-2.07/Unicode/Unicode.xs Sat Oct 23 04:37:21 2004 +++ ext/Encode/Unicode/Unicode.xs Sun Oct 24 21:20:22 2004 @@ -1,5 +1,5 @@ /* - $Id: Unicode.xs,v 2.0 2004/05/16 20:55:16 dankogai Exp $ + $Id: Unicode.xs,v 2.0 2004/05/16 20:55:16 dankogai Exp dankogai $ */ #define PERL_NO_GET_CONTEXT @@ -97,7 +97,7 @@ U8 endian = *((U8 *)SvPV_nolen(attr(endian, 6))); int size= SvIV(attr(size, 4)); int ucs2= SvTRUE(attr(ucs2, 4)); -int clone = SvTRUE(attr(clone, 5)); +int renewed = SvTRUE(attr(renewed, 7)); SV *result = newSVpvn(,0); STRLEN ulen; U8 *s = (U8 *)SvPVbyte(str,ulen); @@ -124,7 +124,7 @@ } #if 1 /* Update endian for next sequence */ - if (clone) { + if (renewed) { hv_store((HV *)SvRV(obj),endian,6,newSVpv((char *)endian,1),0); } #endif @@ -200,7 +200,7 @@ U8 endian = *((U8 *)SvPV_nolen(attr(endian, 6))); int size= SvIV(attr(size, 4)); int ucs2= SvTRUE(attr(ucs2, 4)); -int clone = SvTRUE(attr(clone, 5)); +int renewed = SvTRUE(attr(renewed, 7)); SV *result = newSVpvn(,0); STRLEN ulen; U8 *s = (U8 *)SvPVutf8(utf8,ulen); @@ -211,7 +211,7 @@ enc_pack(aTHX_ result,size,endian,BOM_BE); #if 1 /* Update endian for next sequence */ - if (clone){ + if (renewed){ hv_store((HV *)SvRV(obj),endian,6,newSVpv((char *)endian,1),0); } #endif diff -ruN ext/Encode-2.07/lib/Encode/Encoding.pm ext/Encode/lib/Encode/Encoding.pm --- ext/Encode-2.07/lib/Encode/Encoding.pm Sat Oct 23 04:37:13 2004 +++ ext/Encode/lib/Encode/Encoding.pm Sun Oct 24 20:25:13 2004 @@ -5,6 +5,7 @@ require Encode; +sub DEBUG { 0 } sub Define { my $obj = shift; @@ -16,7 +17,18 @@ sub name { return shift-{'Name'} } -sub renew { return $_[0] } +# sub renew { return $_[0] } + +sub renew { +my $self = shift; +my $clone = bless { %$self } = ref($self); +$clone-{renewed}++; # so the caller can see it +DEBUG and warn $clone-{renewed}; +return $clone; +} + +sub renewed{ return $_[0]-{renewed} || 0 } + *new_sequence = \renew; sub needs_lines { 0 }; @@ -167,24 +179,28 @@ Predefined As: - sub renew { return $_[0] } + sub renew { +my $self = shift; +my $clone = bless { %$self } = ref($self); +$clone-{renewed}++; +return $clone; + } This method reconstructs the encoding object if necessary. If you need to store the state during encoding, this is where you clone your object. -Here is an example: - - sub renew { - my $self = shift; - my $clone = bless { %$self } = ref($self); - $clone-{clone} = 1; # so the caller can see it - return
[Encode] 2.08 released
Porters, On Oct 24, 2004, at 20:50, Dan Kogai wrote: The following patch does that. The new Encode::utf8-decode() checks $self-renewed and if so it sets Encode:RETURN_ON_ERR. Here is the patch or you can wait for Encode-2.08. One patch to Unicode/Unicode.xs was missing and Unicode/Unicode.pm was garbled. Here we go again, the patch against 2.07. Forget the previous patch. Or wait for Encode-2.08 And here comes Encode-2.08. If you are by any chance using Encode-2.07, upgrade RIGHT NOW! =head1 Tested As follows: Perl 5.8.3 on Mac OS X v10.3.5 (/usr/bin/perl, post-built as in CPAN) Perl 5.8.5 on Mac OS X v10.3.5 (post-built) on FreeBSD 4.10-STABLE (post-built) bleedperl on Mac OS X v10.3.5 (integrally built w/ whole perl dist) on FreeBSD 4.10-STABLE (integrally built) =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.08.tar.gz or CPAN near you =head1 Changes $Revision: 2.8 $ $Date: 2004/10/24 13:00:29 $ ! Encode.xs lib/Encode/Encoding.pm Unicode/Unicode.{pm,xs} Resolved the issue that was raised by the Encode::utf8 fallbacks vs. PerlIO::encoding issue that was introduced in 2.07. This is done by making use of -renew() method that used to be used only by Encode::Unicode. -renewed() method was also introduced to fetch the value thereof. Message-Id: [EMAIL PROTECTED] =head1 Epilogue Enjoy! Dan the Encode Maintainer
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
On Oct 25, 2004, at 03:01, Nick Ing-Simmons wrote: But as Dan said at the start \xF6 on its own (say as 1023 octet in a 0..1023 1024-octet buffer is not a fail. Changing that will make :encoding() layer have problems as buffer boundaries can occur in the middle of characters. Right. Encode-2.07 indeed had the problem, resulting bleedperl to fail on ext/PerlIO/t/encoding.t, test 14. Encode-2.08 corrected the problem by checking if the caller is PerlIO and if so, sets Encode::RETURN_ON_ERR so it breaks out of the loop on partial character case. I believe I have checked tested enough but I would appreaciate if you guys take a look, especially Encode.xs and t/fallback.t. Dan the Encode Maintainer
[Encode] 2.06 Released
Porters, I just updated Encode to version 2.06. =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.06.tar.gz or CPAN near you =head1 Changes $Revision: 2.6 $ $Date: 2004/10/22 06:23:11 $ ! ucm/mac* RT #8083 reports that MacThai mapping was obsolete Updated all mac* encodings accordingly to the URI below. One remaining mystery is that MacRomanian vs. MacRumanian. MacRumanian is not found in unicode.org... http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ ! Encode.pm t/Encode.t Fixed RT #8081: decode(..., bless{},'x') segfault Two more tests added to test that. http://rt.cpan.org/NoAuth/Bug.html?id=8081 ! Encode.pm POD revised accordingly to RT #7966 http://rt.cpan.org/NoAuth/Bug.html?id=7966 ! Unicode/Unicode.pm POD updated explaining why Encode::Unicode always croaks on error rather than giving users choices. http://rt.cpan.org/NoAuth/Bug.html?id=7892 =head1 Signature Dan the Encode Maintainer
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote: No, you misread the bug report, I expect that perl -MEncode -e print decode(q(utf-8), qq(Bj\xF6rn)) perl -MEncode -e print decode(q(utf-8), qq(Bj\xF6rnx)) behave the same in that the malformed sequence \xF6 gets replaced by U+FFFD as documented in `perldoc Encode` for check = Encode::FB_DEFAULT. Encode::utf8::decode_xs() fails to do that for the reason outlined in my bug report so the current result is \xF6 ALONE does not mean that the sequence is malformed. Try perl -Mencoding=utf8 -le 'print \x{18}' | hexdump -C Though unicode.org does not assign any character on U+18 (yet), \xF6\x80\x80\x80 is a valid UTF-8 character from perl's point of view. Perl only finds it corrupted when it reaches the following 'r'. In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from UTF-8's point of view). Bj Bj\x{FFFD}rnx it should be Bj\x{FFFD}rn Bj\x{FFFD}rnx So you can't really say which behavior is correct. I fail to see what this has to do with how Perl treats the string as from a Perl perspective there is no real difference here, Perl works as expected, decode() does not. (I've posted this to RT but it again does not show up there, see http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html). IMHO I believe the current implementation is correct since you can't really tell if the sequnece is corrupted just by looking at a given octet. At the same time I believe this should be documented somehow somewhere. Dan the Encode Maintainer
Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars
On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote: C12a in Unicode 4.0.1 notes [...] For example, in UTF-8 every code unit of the form 110 must be followed by a code unit of the form 10xx. A sequence such as 110x 0xxx is illformed and must never be generated. When faced with this ill-formed code unit sequence while transforming or interpreting text, a conformant process must treat the first code unit 110x as an illegally terminated code unit sequence--for example, by signaling an error, filtering the code unit out, or representing the code unit with a marker such as U+FFFD [...] [snip] Okay, you win. You have convinced me that Encode::utf8 should behave the same as Encode::XS (UCM-base encodings). And the patch to make that way is deceptively simple, as follow; === RCS file: Encode.xs,v retrieving revision 2.0 diff -u -r2.0 Encode.xs --- Encode.xs 2004/05/16 20:55:15 2.0 +++ Encode.xs 2004/10/22 18:00:29 @@ -297,7 +297,7 @@ U8 skip = UTF8SKIP(s); if ((s + skip) e) { /* Partial character - done */ - break; + goto decode_utf8_fallback; } else if (is_utf8_char(s)) { /* Whole char is good */ @@ -313,6 +313,7 @@ /* Invalid start byte */ } /* If we get here there is something wrong with alleged UTF-8 */ +decode_utf8_fallback: if (check ENCODE_DIE_ON_ERR){ Perl_croak(aTHX_ ERR_DECODE_NOMAP, utf8, (UV)*s); XSRETURN(0); === The most decisive comment of yours is this: holds true and I expect that my $x = Bj\xF6rn; # as well as Bj\xF6r and Bj\xF6 decode(utf-8, $x, Encode::FB_CROAK); croaks. Which apparently did not. Thank you for being so persitent on this problem. I'd be honor to add your name to AUTHORS file for this. I will $Encode::VERSION++ as soon as I am done w/ the test suites and Tel's patch. This time I will be careful not to screw up (maint|bread)perl so give me some time before the update is ready (but I won't keep you waiting for too long since 5.8.6 deadline is soon). Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is documented as [...] is_utf8(STRING [, CHECK]) [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise. [...] And D36 in Unicode 4.0.1 is very clear that [...] As a consequence of the well-formedness conditions specified in Table 3-6, the following byte values are disallowed in UTF-8: C0C1, F5FF. [...] That's because perl's notion of Unicode is broader than that of unicode.org. So far Unicode.org's mapping only spans from U+ to U+1f, While that of perl is U+ or even U+ (in other words, MAX_UINT). See Camel 3 on details. And I think we can leave this :) Dan the Encode Maintainer
[Encode] 2.07 Released
Porters On Oct 22, 2004, at 15:31, Dan Kogai wrote: I just updated Encode to version 2.06. Within less than 24hrs I resorted to release version 2.07. What the heck. 5.8.6 is soon =head1 Availability http://www.dan.co.jp/~dankogai/cpan/Encode-2.07.tar.gz or CPAN near you =head1 Changes $Revision: 2.7 $ $Date: 2004/10/22 19:35:52 $ ! lib/Encode/Encoding.pm Remove Carp from warnings.pm that influences Encode, by Tels. Message-Id: [EMAIL PROTECTED] ! Encode.xs AUTHORS t/fallback.t Now Encode::utf8's fallbacks are compliant to Encode standard. Thank Bjoern Hoehrmann for persistently convincing me. Message-Id: [EMAIL PROTECTED] ! Encode.pm POD further revised. =head1 Signature Dan the Encode Maintainer
[Encode] 2.03 released
Porters, I have released Encode 2.03 at last; The code was done about 10 days ago but I needed a tester to verify 'piconv vs. Win32' case. Thanks, Steve. =HEAD1 Availability http://www.dan.co.jp/~dankogai/Encode-2.03.tar.gz or CPAN near you (as soon as CPAN is fixed) =HEAD1 Changes $Revision: 2.3 $ $Date: 2004/10/06 05:07:20 $ lib/Encode/Alias.pm Resolved some alias case sensitivity glitches reported via RT. http://rt.cpan.org/NoAuth/Bug.html?id=7835 bin/piconv Resolved Win32 glitches reported via RT. (Fixed by dankogai and tested by Steve Hay) http://rt.cpan.org/Ticket/Display.html?id=7831 JP/JP.pm lib/Encode/Alias.pm lib/Encode/Supported.pod AUTHORS /\bwindows-31j$/i is now an alias of CP932, by Steve Hay. http://rt.cpan.org/NoAuth/Bug.html?id=6695 Yours, Dan the Encode Maintainer
Re: [Encode] Request for testing: piconv vs. Win32
On Oct 06, 2004, at 02:12, Steve Hay wrote: I created the attached in file (utf-8) and ran the command-line in the original bug report: piconv -f utf-8 -t UTF-16LE in out This produces the attached out file, which looks right to me. I also saved the in file to another name in UTF-16LE format using Windows' Notepad program and the file that it output was identical to the out file produced by piconv. So that's a thumbs-up from me. I'm running Encode 2.02 with perl 5.8.5 on Windows XP. Thanks a meg. I'll $Encode::VERSION++ right away. Dan the Encode Maintainer
[Encode] Request for testing: piconv vs. Win32
Porters, Can somebody w/ Win32 access test the patch below mentioned in RT#7831: piconv with ascii-incompatible output breaks on Win32 ? http://rt.cpan.org/NoAuth/Bug.html?id=7831 I have submitted the patch but there was no response from the reporter and I do not have an access to Win32 platforms right now. Dan the Encode Maintainer perl -MEncode -e local$/;$_=;binmode(STDOUT);Encode::from_to($_,'utf-8','UTF- 16LE');print in out This should be resolved by applying binmode to the input filehandle. The patch below should fix that. Would you try and tell me what happens? Dan the Maintainer Thereof --- bin/piconv 2004/05/16 20:55:16 2.0 +++ bin/piconv 2004/09/30 18:40:17 @@ -1,5 +1,5 @@ #!./perl -# $Id: piconv,v 2.0 2004/05/16 20:55:16 dankogai Exp dankogai $ +# $Id: piconv,v 2.0 2004/05/16 20:55:16 dankogai Exp $ # use 5.8.0; use strict; @@ -52,25 +52,39 @@ EOT } -# default -if ($scheme eq 'from_to'){ -while(){ - Encode::from_to($_, $from, $to, $Opt{check}); print; -}; -# step-by-step -}elsif ($scheme eq 'decode_encode'){ - while(){ - my $decoded = decode($from, $_, $Opt{check}); - my $encoded = encode($to, $decoded); - print $encoded; -}; -# NI-S favorite -}elsif ($scheme eq 'perlio'){ -binmode(STDIN, :encoding($from)); -binmode(STDOUT, :encoding($to)); -while(){ print; } -} else { # won't reach -die $name: unknown scheme: $scheme; +# we do not use (or ARGV) for the sake of binmode() [EMAIL PROTECTED] or push @ARGV, \*STDIN; + +unless ($scheme eq 'perlio'){ +binmode STDOUT; +for my $argv (@ARGV){ + my $ifh = ref $argv ? $argv : undef; + $ifh or open $ifh, , $argv or next; + binmode $ifh; + if ($scheme eq 'from_to'){ # default + while($ifh){ + Encode::from_to($_, $from, $to, $Opt{check}); + print; + } + }elsif ($scheme eq 'decode_encode'){ # step-by-step + while($ifh){ + my $decoded = decode($from, $_, $Opt{check}); + my $encoded = encode($to, $decoded); + print $encoded; + } + } else { # won't reach + die $name: unknown scheme: $scheme; + } +} +}else{ +# NI-S favorite +binmode STDOUT = raw:encoding($to); +for my $argv (@ARGV){ + my $ifh = ref $argv ? $argv : undef; + $ifh or open $ifh, , $argv or next; + binmode $ifh = raw:encoding($from); + print while($ifh); +} } sub list_encodings{
[Encode] 2.00 released!
Porters, I have just released Encode version 2.00. Though major version has been incremented, there is no big feature (addition|change)s. =head1 AVAILABILITY http://www.dan.co.jp/~dankogai/Encode-2.00.tar.gz or CPAN near you =head1 CHANGES $Revision: 2.0 $ $Date: 2004/05/16 20:55:15 $ * version updated to 2.00 -- sorry, no big feature change. I just hate version 1.100 :) ! lib/Encode/Guess.pm Unicode/Unicode.pm addressed UTF-(8|32LE) + BOM misguessing https://rt.cpan.org/Ticket/Display.html?id=6279 ! Encode.pm s/is_utif8/is_utf8/ in POD ! Encode/lib/Encode/CN/HZ.pm Fixes make test failure after the patch to pp_hot.c by Sadahiro-san Message-Id: [EMAIL PROTECTED] ! bin/piconv From: [EMAIL PROTECTED] Subject: [PATCH] piconv -C 512 badly broken Message-Id: [EMAIL PROTECTED] Some of the changes are already committed in Perl 5.8.[34] and maintperl but without new releases older perls are left behind so I released. Enjoy! Dan the Encode Maintainer
Re: UTF8 behavior under -T (Taint) mode
On Jan 01, 2004, at 12:32, Masanori HATA wrote: Hello, I have a simple question: It seems that utf8::decode() does not work for any tainted variables under the -T (Taint) mode. Is it right? Wrong. What drove you to such a conclusion? It does work. Try something like perl -T -le 'utf8::decode($ARGV[0])' something and see it for yourself. Did perl die with Insecure ... message? Dan the Perl5 Porter
Re: UTF8 behavior under -T (Taint) mode
On Jan 01, 2004, at 21:49, Masanori HATA wrote: Sorry, no. Since the case which I would like to suggest seems not to be fatal. Perl would not die, but it would take the tainted value as a Non-UTF8 string. My sample code is like below (test.pl): - utf8::decode(my $text0 = \x{3042} ); # clean utf8::decode(my $arg = $ARGV[0]); # tainted utf8::decode(my $text1 = $arg$text0); # tainted utf8::decode(my $text2 = $text0$arg); # tainted print length($text1), \n; print length($text2), \n; - Aha! I see your point at last. And I found your argument was correct. When I run this code with 'perl -T test.pl a', the result is: To clear your point, I have modified your script with Devel::Peek. Pay attention to the $text1 result. without -T % perl test.pl a SV = PV(0x812354) at 0x80a960 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x428090 a\343\201\202\0 [UTF8 a\x{3042}] CUR = 4 LEN = 5 2 SV = PV(0x812e10) at 0x80f2a8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x405150 \343\201\202a\0 [UTF8 \x{3042}a] CUR = 4 LEN = 5 2 with -T % perl -T test.pl a SV = PVMG(0x819a88) at 0x80a954 REFCNT = 1 FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK) IV = 0 NV = 0 PV = 0x428540 a\343\201\202\0 CUR = 4 LEN = 5 MAGIC = 0x405480 MG_VIRTUAL = PL_vtbl_taint MG_TYPE = PERL_MAGIC_taint(t) MG_LEN = 1 4 SV = PVMG(0x819af4) at 0x80f69c REFCNT = 1 FLAGS = (PADBUSY,PADMY,GMG,SMG,pPOK,UTF8) IV = 0 NV = 0 PV = 0x4054e0 \343\201\202a\0 [UTF8 \x{3042}a] CUR = 4 LEN = 5 MAGIC = 0x4010d0 MG_VIRTUAL = PL_vtbl_taint MG_TYPE = PERL_MAGIC_taint(t) MG_LEN = 1 2 I am not sure how severe it is but this is a bug indeed. (My system is perl5.8.1 MSWin32-X86-multi-thread) I have duplicated the result with Perl 5.8.2 on Mac OS X as well as [EMAIL PROTECTED] on FreeBSD. And using Encode::decode_utf8 does not help either because it simply calls utf8::decode. And you can't use Encode::decode(utf8, ...) in this particular case because Encode::decode() checks and clobbers at Cannot decode string with wide characters. Hmm Dan the Perl5 Porter
Re: [PATCH] piconv -C 512 doesn't work.
On Friday, Sep 26, 2003, at 23:18 Asia/Tokyo, Autrijus Tang wrote: It is unfortunate that this very handy utility breaks: $ piconv -f utf8 -t ascii -C 1024 Can't open 1024: No such file or directory at piconv line 60. Thanks, applied in my repository. Dan the Encode Maintainer
Re: Inverse of /\p{script}/
On Thursday, Aug 28, 2003, at 23:16 Asia/Tokyo, [EMAIL PROTECTED] wrote: Does the existing perl5.8.* Unicode support have a way to efficently determine which script(s) or block (in unicode sense) a code point belongs to? In Unicode-aware Tk I am still doing battle with mechanism to select X11 font to display a particular codepoint (for now glossing over glyph vs character issues). The present code is still rather dumb. That's what Encode::InCharset is for. Available via CPAN. http://search.cpan.org/author/DANKOGAI/Encode-InCharset-0.03/ It seems to make sense to have a hash which maps script names to probable (font) encodings (Hiragana | Katakana | Han) = 'jisx0208.1990-0' The module makes it \p{InJIS0208} ... (Greek) = 'iso8859-7', And \p{InISO_8859_7}, respectively. So give a (1 character) string how do I get Unicode script/block it is in? One caveat, however. It is slightly out of sync w/ the latest Encode. You should stay away from vendor encodings that are thoroughly revised in Encode 1.75 - 1.98 (FYI ENcode::InCharset is still based upon 1.75). Dan the Encode Maintainer
Re: [Patch] Encode.pm : euro sign missing in cp936.ucm
SADAHIRO-san and cp9?? experts, On Thursday, Mar 27, 2003, at 00:44 Asia/Tokyo, SADAHIRO Tomoyuki wrote: +U20AC \x80 |0 # EURO SIGN Is this right? Yes, U20AC is indeed missing from cp936.ucm but see this; grep U20AC ucm/cp*.ucm /Users/dankogai/work/Encode/ucm/cp1250.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1251.ucm:U20AC \x88 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1252.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1253.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1254.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1255.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1256.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1257.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp1258.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp874.ucm:U20AC \x80 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp949.ucm:U20AC \xA2\xE6 |0 # EURO SIGN /Users/dankogai/work/Encode/ucm/cp950.ucm:U20AC \xA3\xE1 |0 # EURO SIGN \x80 SEEMS right for single-byte CPs but they are mapped differently in CP949 and CP950. So far as I check the Microsoft's pages http://www.microsoft.com/typography/unicode/cscp.htm - http://www.microsoft.com/globaldev/reference/wincp.mspx - http://www.microsoft.com/globaldev/reference/dbcs/936.htm it indeed does use \x80 (though only \x00-\xFF are covered; Where the heck is the FULL MAP!?). But it seem this only applies to 936. 932 (Japanese; Shift_JIS based), 949 (Korean; euc-kr based) and 950 (Traditional Chinese; Big5-based) all leave \x80 blank. I would like more confirmation from experts; cp936.ucm has been overhauled with a help of MORIYAMA san and back then and at that time FULL map was available from the URIs above. And I think \x80 was not used for EURO SIGN back then. Oh, I still have a copy of full mapping that was one available via URI above. Let's see... cp936.txt says... CODEPAGE 936; PRC GBK (XGB) - ANSI, OEM CPINFO 2 0x3f 0x003f; DBCS CP, Default Char = Question Mark MBTABLE 130 0x000x ;Null [snip] 0x200x0020 ;Space [snip] 0x7f0x007f ;^? 0x800x0080 ;80 0xff0xf8f5 ;FF \x80 is mentioned but not mapped to EURO SIGN. Please somebody tell me where to find the FULL map. Dan the Encode Maintainer with Too Many (Dead) Links to Follow
Re: Warning messages for ill-formed data
Autrijus (and Porters), I think you are following this thread but in case you are not, Sadahiro-san proposes that some extraneous (and presumably unneeded) control characters in \x80-\xA0 in big5-eten map be removed to solve problems that arise in certain circumstances. Since these control characters are just duplicates at \x00-\x20, I think it is a good idea to go for it (and do the same to big5-hkscs.ucm). But I am not as sure of Big5 as you are please check if the proposal is right. If you affirm the idea, I'll $Encode::VERSION++. Dan the Encode Maintainer On Tuesday, Mar 25, 2003, at 21:53 Asia/Tokyo, SADAHIRO Tomoyuki wrote: Well, is it right? I'm not sure of the status and the single byte-range for Big-5, though. diff -urN ucm~/big5-eten.ucm ucm/big5-eten.ucm --- ucm~/big5-eten.ucm Thu Jan 23 23:21:00 2003 +++ ucm/big5-eten.ucm Tue Mar 25 21:43:00 2003 @@ -137,38 +137,6 @@ U007E \x7E |0 # TILDE U007F \x7F |0 # DELETE U0080 \x80 |0 # control -U0081 \x81 |0 # control -U0082 \x82 |0 # BREAK PERMITTED HERE -U0083 \x83 |0 # NO BREAK HERE -U0084 \x84 |0 # control -U0085 \x85 |0 # NEXT LINE -U0086 \x86 |0 # START OF SELECTED AREA -U0087 \x87 |0 # END OF SELECTED AREA -U0088 \x88 |0 # CHARACTER TABULATION SET -U0089 \x89 |0 # CHARACTER TABULATION WITH JUSTIFICATION -U008A \x8A |0 # LINE TABULATION SET -U008B \x8B |0 # PARTIAL LINE DOWN -U008C \x8C |0 # PARTIAL LINE UP -U008D \x8D |0 # REVERSE LINE FEED -U008E \x8E |0 # SINGLE SHIFT TWO -U008F \x8F |0 # SINGLE SHIFT THREE -U0090 \x90 |0 # DEVICE CONTROL STRING -U0091 \x91 |0 # PRIVATE USE ONE -U0092 \x92 |0 # PRIVATE USE TWO -U0093 \x93 |0 # SET TRANSMIT STATE -U0094 \x94 |0 # CANCEL CHARACTER -U0095 \x95 |0 # MESSAGE WAITING -U0096 \x96 |0 # START OF GUARDED AREA -U0097 \x97 |0 # END OF GUARDED AREA -U0098 \x98 |0 # START OF STRING -U0099 \x99 |0 # control -U009A \x9A |0 # SINGLE CHARACTER INTRODUCER -U009B \x9B |0 # CONTROL SEQUENCE INTRODUCER -U009C \x9C |0 # STRING TERMINATOR -U009D \x9D |0 # OPERATING SYSTEM COMMAND -U009E \x9E |0 # PRIVACY MESSAGE -U009F \x9F |0 # APPLICATION PROGRAM COMMAND -U00A0 \xA0 |0 # NO-BREAK SPACE U00A7 \xA1\xB1 |0 U00A8 \xC6\xD8 |0 U00AF \xA1\xC2 |0 @@ -178,11 +146,6 @@ U00D7 \xA1\xD1 |0 U00F7 \xA1\xD2 |0 U00F8 \xC8\xFB |0 -U00FA \xFA |0 # LATIN SMALL LETTER U WITH ACUTE -U00FB \xFC |0 # LATIN SMALL LETTER U WITH CIRCUMFLEX -U00FD \xFD |0 # LATIN SMALL LETTER Y WITH ACUTE -U00FE \xFE |0 # LATIN SMALL LETTER THORN -U00FF \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS U014B \xC8\xFC |0 U0153 \xC8\xFA |0 U0250 \xC8\xF6 |0 diff -urN ucm~/big5-hkscs.ucm ucm/big5-hkscs.ucm --- ucm~/big5-hkscs.ucm Thu Jan 23 23:21:02 2003 +++ ucm/big5-hkscs.ucm Tue Mar 25 21:37:10 2003 @@ -136,13 +136,6 @@ U007E \x7E |0 # TILDE U007F \x7F |0 # DELETE U0080 \x80 |0 # control -U0081 \x81 |0 # control -U0082 \x82 |0 # BREAK PERMITTED HERE -U0083 \x83 |0 # NO BREAK HERE -U0084 \x84 |0 # control -U0085 \x85 |0 # NEXT LINE -U0086 \x86 |0 # START OF SELECTED AREA -U0087 \x87 |0 # END OF SELECTED AREA U00A7 \xA1\xB1 |0 U00A8 \xC6\xD8 |0 U00AF \xA1\xC2 |0 @@ -171,7 +164,6 @@ U00F9 \x88\x7B |0 U00FA \x88\x79 |0 U00FC \x88\xA2 |0 -U00FF \xFF |0 # LATIN SMALL LETTER Y WITH DIAERESIS U0100 \x88\x56 |0 U0101 \x88\x67 |0 U0112 \x88\x5A |0 Regards, SADAHIRO Tomoyuki I often encounter lower-ascii codes mixed in with Big5 text, which is fine and straightforward to handle. However, a problem arises when upper ascii occasionally occur outside of the Big5 range. When such a character occurs, this is probably an error or part of a user-defined character. However, it appears that Encode DOES NOT display warnings for these but rather maps individual upper ascii to conventional characters such as Roman letters with diacritics commonly found in European languages. (It appears that Encode displays warnings for characters that are within the Big5 range, but do not have a mapping to Unicode, perhaps because these code points are not used in Big5 itself.) Is there a way to cause Encode to display warnings for upper ascii outside of the Big5 range when converting from Big5 to Unicode? If not, could the developers consider this for a future fix? Mark
Re: Warning messages for ill-formed data
On Tuesday, Mar 25, 2003, at 13:59 Asia/Tokyo, Mark Lewellen wrote: Is there a way to cause Encode to display warnings for upper ascii outside of the Big5 range when converting from Big5 to Unicode? If not, could the developers consider this for a future fix? Use the optional 3rd argument to decode(). $utf8 = decode(Big5 = $big5); # ill-formed chars are mapped to U+FFFD $utf8 = decode(Big5 = $big5, Encode::FB_WARN); # same but warnings issued see Handling Malformed Data of perldoc Encode for how to use the 3rd argument. Dan the Encode Maintainer
Re: Encode 1.87 and later don't pass make test on static perl
Sorry, my finger has slipped. On Thursday, Mar 6, 2003, at 22:48 Asia/Tokyo, Dan Kogai wrote: On Thursday, Mar 6, 2003, at 04:10 Asia/Tokyo, Blair Zajac wrote: Hello, I have several self compiled copies of Perl 5.8.0, one of which is compiled to be statically linked (for Perl modules that is, not libc and other system libraries) so I can profile the code using gcc -pg. How EXACTLY did you do so? Here is a recommended way. 0) rm -rf perl-5.8.0/ext/Encode 1) untargzip Encode-1.xx and mv that perl-5.8.0/ext/Encode 2) update perl-5.8.0/MANIFEST 3) configure perl 4) make and make test Maybe you should start over w/ a fresh copy of perl-5.8.0 as well. I looked for Encode 1.87 on CPAN but couldn't find it, so here's the error output on with 1.89, but the problem was introduced in 1.87. Perl -V output is below. If I can find tar.gz's of Encode 1.86 and 1.87, I could check again to ensure that 1.86 and 1.87 work and fail respectively. Andreas has already answered this one (Thanks, Andreas). The reason I am asking if you are following the right procedure is this; Encode::KR object version 1.22 does not match bootstrap parameter 1.23 at /opt/i386-linux/installed/perl-5.8.0-g-pg/lib/5.8.0/i686-linux/ XSLoader.pm line 44. Here the module mismatch is happening, suggesting that old symbol(s) still exist somewhere... Dan the Encode Maintainer
Re: [PATCH] viscii.ucm
SADAHIRO Tomoyuki [EMAIL PROTECTED] I doubt whether the Unicode consortium had provided any viscii-Unicode mapping table under www.unicode.org/Public/MAPPINGS/. I could suppose the table shipped on Perl was borrowed from czyborra.com. The table there ( http://czyborra.com/charsets/vietnamese.html ) has wrongly the duplicated A^? and no a^?. The following site provides a correct table. http://www.vietstd.org/document/unicode.html Okay. I'll replace viscii.ucm in the next release. On Sunday, Feb 16, 2003, at 20:10 Asia/Tokyo, Jarkko Hietaniemi wrote: ...or it could come from the Tcl/Tk mapping tables? I think this is it. When I took over Encode maintenance, I have rebuilt whatever mappping unicode.org did have but I don't recall doing so for viscii so it must have come from viscii.enc. Dan the Encode Maintainer
Re: [PATCH] viscii.ucm
On Sunday, Feb 16, 2003, at 12:26 Asia/Tokyo, SADAHIRO Tomoyuki wrote: Hello. I've found the mapping in the present viscii.ucm have a bit diffrence from that in RFC 1456. U+1EA8 (A^? in VIQR) should be mapped for 0x86, and U+1EA9 (a^? in VIQR) for 0xA6. Here is a test scratch to check whether VISCII indeed supports all the Latin extensions for Vietnamese, i.e. U+1EA0..U+1EF9. #!perl for (my $u = 0x1EA0; $u = 0x1EF9; $u++) { my $e = encode(VISCII, chr $u, Encode::FB_WARN); } warn End.; __END__ Thanks. Applied in my repository. Dan the Encode Maintainer P.S. Is ftp.funet.fi still down? I think I am ready to $Encode:VERSION++ with this patch applied.
Re: Handling MacArabic in perl 5.8.0
David and Sadahiro-san, On Wednesday, January 29, 2003, at 11:58 PM, SADAHIRO Tomoyuki wrote: On Tue, 28 Jan 2003 01:48:42 -0500 David Graff [EMAIL PROTECTED] wrote: BTW, I just noticed that the unicode web site now has a more recent version of the APPLE/ARABIC.TXT mapping page than the one I cited earlier, and the new version offers improved/expanded commentary: http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ARABIC.TXT (dated Dec. 19, 2002). Oh shoot. Should I rebuild Encode/ucm/macArabic.ucm? FYI I am a Mac user but I am on OS X so I don't have many chances to come accross mac* encodings. Dan the Encode Maintainer
[Encode] HEADS-UP; $Encode::VERSION++ to enhance filter option
Porters, In the recent discussion in various perl-related MLs in Japanese, I have discovered a problem that the encoding pragma does not work on such multibyte encodings as Shift_JIS which uses 0x00-0x7f ranges in the 2nd byte. Though not test I am pretty sure big5 is also prone to this. To understand this problem please have a look at the hexdump below; % hexdump -C enc-sjis.pl 23 2f 75 73 72 2f 6c 6f 63 61 6c 2f 62 69 6e 2f |#/usr/local/bin/| 0010 70 65 72 6c 20 2d 77 0a 75 73 65 20 73 74 72 69 |perl -w.use stri| 0020 63 74 3b 0a 75 73 65 20 65 6e 63 6f 64 69 6e 67 |ct;.use encoding| 0030 20 27 73 68 69 66 74 2d 6a 69 73 27 3b 0a 0a 6d | 'shift-jis';..m| 0040 79 20 24 6e 61 6d 65 20 3d 20 22 94 5c 22 3b 0a |y $name = .\;.| 0050 70 72 69 6e 74 20 24 6e 61 6d 65 3b 0a 77 72 69 |print $name;.wri| 0060 74 65 3b 0a 0a 66 6f 72 6d 61 74 20 53 54 44 4f |te;..format STDO| 0070 55 54 20 3d 0a 94 5c 97 cd 3a 40 3c 3c 3c 0a 24 |UT =..\..:@.$| 0080 6e 61 6d 65 0a 2e 0a |name...| The perl script is a valid perl script in Shift JIS but the quoted character (U+80fd, \x94\x5c in Shift_JIS) uses \x5c in the 2nd byte, mangling the script. The encoding pragma needs to be parsable ASCII-wise. Fortunately, the encoding pragma offers a different approach via Filter=1. The problem is that Filter option was incomplete in two ways. 0. Filter=1 leaves STD(IN|OUT) untouched. Not only does it leave STD* untouched it completely ignores STD*= hooks that non-filter version offers. 1. In order to touch STD(IN|OUT) sensibly you have to 'use utf8' in the script to make sure the literals therein are utf8-flagged but that makes the code too counterintuitive. The following patch fixes that so the filter option is more useful. I am planning to apply this patch to the next version of Encode but I still need to fix the POD and write test suites. So I decided to issue a waring before committing a release. Dan the Encode Maintainer --- encoding.pm 2003/01/22 03:29:07 1.40 +++ encoding.pm 2003/01/26 07:03:59 @@ -35,33 +35,11 @@ unless ($arg{Filter}) { ${^ENCODING} = $enc unless $] = 5.008 and $utfs{$name}; $HAS_PERLIO or return 1; - for my $h (qw(STDIN STDOUT)){ - if ($arg{$h}){ - unless (defined find_encoding($arg{$h})) { - require Carp; - Carp::croak(Unknown encoding for $h, '$arg{$h}'); - } - eval { binmode($h, :encoding($arg{$h})) }; - }else{ - unless (exists $arg{$h}){ - eval { - no warnings 'uninitialized'; - binmode($h, :encoding($name)); - }; - } - } - if ($@){ - require Carp; - Carp::croak($@); - } - } }else{ defined(${^ENCODING}) and undef ${^ENCODING}; eval { require Filter::Util::Call ; Filter::Util::Call-import ; - binmode(STDIN); - binmode(STDOUT); filter_add(sub{ my $status; if (($status = filter_read()) 0){ @@ -71,7 +49,31 @@ $status ; }); }; + # internally use utf8 to make sure utf8 flags are set + # for literals. + use utf8 (); # to fetch $utf8::hint_bits; + $^H |= $utf8::hint_bits; # warn Filter installed; +} +for my $h (qw(STDIN STDOUT)){ + if ($arg{$h}){ + unless (defined find_encoding($arg{$h})) { + require Carp; + Carp::croak(Unknown encoding for $h, '$arg{$h}'); + } + eval { binmode($h, :encoding($arg{$h})) }; + }else{ + unless (exists $arg{$h}){ + eval { + no warnings 'uninitialized'; + binmode($h, :encoding($name)); + }; + } + } + if ($@){ + require Carp; + Carp::croak($@); + } } return 1; # I doubt if we need it, though }
Re: Encode utf-16 problem
On Tuesday, Dec 3, 2002, at 11:12 Asia/Tokyo, Jarkko Hietaniemi wrote: Why the 'Partial character' warnings? I would have though the input files are just right. Also, the warnings are given to stderr unconditionally, I would have to redirect stderr to /dev/null to get rid of the warnings. $ perl -le 'print pack(v*, 0xFEFF, unpack(C*, test))' ! utf16 $ hex utf16 ff fe 74 00 65 00 73 00 74 00 0a..t.e.s.t.. $ ./perl -Ilib -e 'open(FH, :encoding(utf16), utf16);$a=FH;print $a'|hex UTF-16:Partial character at -e line 1. UTF-16:Partial character at -e line 1. 74 65 73 74 test $ perl -le 'print pack(n*, 0xFEFF, unpack(C*, test))' ! utf16 $ hex utf16 fe ff 00 74 00 65 00 73 00 74 0a...t.e.s.t. $ ./perl -Ilib -e 'open(FH, :encoding(utf16), utf16);$a=FH;print $a'|hex UTF-16:Partial character at -e line 1. UTF-16:Partial character at -e line 1. 74 65 73 74 test $ Aw. You can't use 'utf16' for use encoding or PerlIO. You have to specify the endianness. Because of the BOM mark you can't use it for PerlIO stream. I'll tweak Unicode.pm so that perlio_ok returns 0 for BOMless UTF's in the next version Dan the Encode Maintainer
[Encode] HEADS-UP: NC patch will be in
NC and porters, First of all, this is a great patch. Not only does it optimize the resulting shlibs, it seems to consume less memory during compilation. On Monday, Nov 4, 2002, at 12:26 Asia/Tokyo, [EMAIL PROTECTED] wrote: Nicholas Clark [EMAIL PROTECTED] wrote: :I've been experimenting with how enc2xs builds the C tables that turn into the :shared objects. enc2xs is building tables (arrays of struct encpage_t) which :in turn have pointers to blocks of bytes. Great, you seem to be getting some excellent results. Worked absolutely fine on my PowerBook G4, too. Before: 208948 Encode/Byte/Byte.bundle 1984416 Encode/CN/CN.bundle 30076 Encode/EBCDIC/EBCDIC.bundle 33728 Encode/Encode.bundle 2590420 Encode/JP/JP.bundle 2208996 Encode/KR/KR.bundle 39720 Encode/Symbol/Symbol.bundle 1940288 Encode/TW/TW.bundle 17892 Encode/Unicode/Unicode.bundle After: 178220 Encode/Byte/Byte.bundle 1085116 Encode/CN/CN.bundle 25336 Encode/EBCDIC/EBCDIC.bundle 33604 Encode/Encode.bundle 1308568 Encode/JP/JP.bundle 1209804 Encode/KR/KR.bundle 34896 Encode/Symbol/Symbol.bundle 1059040 Encode/TW/TW.bundle 17892 Encode/Unicode/Unicode.bundle I have also wondered whether the .ucm files are needed after these have been built; if not, we should consider supplying with perl only the optimised table data if that could give us a space saving in the distribution - it would cut build time significantly as well as allowing us to consider algorithms that take much longer over the table optimisation, since they need be run only once when we integrate updated .ucm files. Trivial yet effective patch is to strip all comments therein. That should dramatically saves space but since *.ucm is, in a way, a source. So I am not sure if I should go for it Anyway, I am pretty much for integrating NC patch not just because it reduces shlib sizes but it also appears compiler safer (one of the optimizer features (AGGREGATE_TABLES) was dropped during the dev phase of perl 5.8 for the sake of djgpp and other low memory platforms). Unfortunately I am at my parents' place this week (to finish the book I am writing -- away from kids) so I do not have as much resources for extensive tests (the FreeBSD box I was using here at my parents just died (physically) the day before I came :-( ). Another concern is that since it changes the internal structure of shlibs CPANized Encode::* modules need to be rebuilt as well, so the released version needs to print a warning on that -- oh wait! Encode.xs remains unchanged so Encode::* may still work Thank you, NC. Dan the Encode Maintainer
Re: [Encode] HEADS-UP: NC patch will be in
On Monday, Nov 4, 2002, at 20:11 Asia/Tokyo, Dan Kogai wrote: oh wait! Encode.xs remains unchanged so Encode::* may still work Confirmed. The NC patch works w/ preexisting shlibs. perl -MEncode -e 'print Encode-VERSION, \n' 1.81 # not released, of course! perl -MEncode::HanExtra -e 1 Dan the Encode Maintainer
How to name CJK ideographs
On Saturday, Oct 26, 2002, at 03:55 Asia/Tokyo, Jungshik Shin wrote: Another possibility is 'meaning-pronunciation' index. I believe this is one of a few ways to refer to CJK characters (say, over the phone) in all CJK countries. However, to do this, we need much more raw data (more or less like a small dictionary) than UniHan DB provides because it lists meanings of characters in English only. That's one thing I wish I could do -- Dan as in Bomb because I can't go like YOU five ef three ee :) I know that's difficult but it strikes me to find out we still have no way to canonically specify (Hanzi|Kanji|Hanja) after all these years (besides Unicode code points but who the heck wants to do so ?). perl -e 'print \x{5c0f}\x{98fc} \x{5f3e}\n;
Re: [Encode] 1.80 released
On Friday, Oct 25, 2002, at 09:29 Asia/Tokyo, [EMAIL PROTECTED] wrote: I'd recommend the small patch below, which will make it possible to run the new rt.pl in any of the standard manners under the core: ( cd t ; ./perl TEST ../ext/Encode/t/rt.pl ) ( cd t ; ./perl harness ../ext/Encode/t/rt.pl ) PERL_CORE=1 ./perl -Ilib ext/Encode/t/rt.pl With this patch, those tests also pass (eventually :). Thanks, applied back :) I feel relieved now. And I am doubly relieved to find how meticulous a pumpking you are. I wonder why Net::Ping's (rather obvious bug (for *BSD users)) slipped thru :) And I was surprised to find your name was not on ext/Encode/AUTHORS. Now added. With that done, please proceed to the next patch to fix tr/// From: Dan Kogai [EMAIL PROTECTED] Date: Mon Oct 21, 2002 17:36:02 Asia/Tokyo To: hv [EMAIL PROTECTED], Inaba Hiroto [EMAIL PROTECTED], Jarkko Hietaniemi [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: The Inaba patch for tr/// vs. use encoding Message-Id: [EMAIL PROTECTED] I KNOW you are working on it (at least reviewing it) but just for reminder Dan the Perl5 Porter
Re: Unicode. Perl does the right thing?
On Friday, Oct 25, 2002, at 14:10 Asia/Tokyo, Philip Newton wrote: (B Well, partially because there's no "good" names for many of the (B characters. What do you call "$B@8(B"? "CJK UNIFIED IDEOGRAPH-751F"? (That's (B the current Unicode "name", but it's not particularly useful.) "CJK (B shou"? "CJK sei"? "CJK sheng1"? "CJK saeng"? "CJK ikiru"? ikasu, ikeru, (B umareru, umu, ou, haeru, hayasu, ki, nama, naru, nasu, musu, which (B one do you pick? (B (BIf we are stuck with de jure, ex officio names from Unicode Consortium (Bwe are out of luck but this is perl; if there are more than one way to (Bdo it, Why not more than one way to name it? I am kind of wondering a (Bcharnames extension that goes like (B (Buse charnames ":ja"; # Japanese (Bprint "\N{sei-ikiru}"; (B# (Buse charnames ":ko"; (Bprint "\N{saeng}"; (B# (Buse charanames ":zh"; (Bprint "\N{sheng1}"; (B (BSince pragmatic approach is rather inflexible, I would prefer OO (Baproach, like (B (Buse Char::Name; (B (Bmy $char = Char::Name-new; (B (Bprint $char-jp("sei-ikiru"); (B (BI know Japanese is the biggest nightmare to name characters because in (BJapanese we give too many "names" to each character; It's really hard (Bto disambiguate these (B (BI may come up with something as I look though Unihan DB, now accessible (Bvia CPAN (Unicode::Unihan) (B (B Cheers, (B Philip Newton ($BIT0aN'ITF~FZ(B) (B (B\x{5c0f}\x{98fc} \x{5f3e}
[Encode] 1.80 released
Hugo and porters, I have released Encode 1.80 despite the fact I just released 1.79 less than 24 hours ago. Whole: http://www.dan.co.jp/~dankogai/Encode-1.80.tar.gz and CPAN Change is very small; it just includes a patch from NI-XS. $Revision: 1.80 $ $Date: 2002/10/21 20:39:09 $ ! Encode.xs t/mime-header.t Even more patches from NI-XS regarding Encode::utf8-decode(). And one more test to t/mime-header.t to prove it Message-Id: [EMAIL PROTECTED] Still I decided to go for 1.80 because I reckon you are yet to commit the latest Encode to bleedperl. If you haven't work on it that's fine; just skip 1.7? and go straight to 1.80. Apologies and thanks. Dan the Encode Maintainer.
[Encode] 1.79 released
porters, I have decided to release Encode 1.79 so soon after 1.78 with two reasons. Whole: http://www.dan.co.jp/~dankogai/Encode-1.79.tar.gz and CPAN =head1 reasons =over =item 1 The latest patch to Encode.(pm|xs) by Nick In-XS to relocate Encode::utf8 from .pm to .xs has introduced a minor bug that was revealed in t/mime-header.t. It was due to the fact that Encode::utf8-decode() attempts to decode even when the argument is already flagged as utf8 string. Encode 1.78 fixed the problem by mending lib/Encode/MIME/Header.pm but Nick In-XS has sent me a patch to Encode.xs. =item 2 M$ version of the mapping in cp949 (Korean) and cp950 (Trad. Chinese) was obsolete, resulting U+20AC (EURO SIGN) and U+00AE (REGISTERED SIGN) missing. This time Moriyama-san has tested them against conversions via Win32 API and verified that they all matches now (at leased those marked as round-trippable). =back =head2 grumbles Frankly, I am f.*ing tired of hearing about any M$-related char map issues. This is to close the cp9?? cases altogether. From now on I will happily ignore any claims saying 'cp??? seems to be wrong' unless M$ fixes their web pages ( http://www.microsoft.com/typography/unicode/cscp.htm -- it's gone!) and THE ATTITUDE (no news on the shutdown of the page above was released to the community). I have just had enough, m'kay? =head1 Changes $Revision: 1.79 $ $Date: 2002/10/21 06:05:37 $ ! Encode.xs Further patches from NI-XS. Encode::utf8-decode() now checks the value of utf8 flag of the argument. As a result, the fix to lib/Encode/MIME/Header.pm is no longer neccessary but since it did no harm (even speedwise) I'll leave it unreverted. ! ucm/cp949.ucm ucm/cp950.ucm U+20AC EURO SIGN U+00AE REGISTERED SIGN were missing as a result of 1.78. Discovered by Moriyama-san. Moriyama-san has also developed a test script that compares (en|de)coded results to the corresponding Win32 API result and all cp9?? maps are now verified. Message-Id: [EMAIL PROTECTED] =head1 AUTHOR Dan the Encode Maintainer
[Encode] HEADS-UP: ucm/cp932.ucm will be updated
Porters (especially Nick Ing-XS), I would like to release Encode 1.78 soon to address the problem in CP932 (MS version of Shift_JIS) which MORIYAMA Masayuki [EMAIL PROTECTED] has discovered. Not only has he addressed the problem he has also supplied me a patch. Though he was reluctant to come to perl(5-porters|unicode)@perl.org (I have invited him but I was too shy to talk to us in English), the problem and solution he has raised was too good to ignore so I would like to update Encode on his behalf. Here is the summery of his points. * ucm/cp932.ucm was based on the mapping file at unicode.org [0] but that mapping is obsolete; it works on Windows 3.1 but not in the era of Win32. * as a result, cp932 is rendered almost useless, at least too impractical * patch was made available [1] My first suggestion was to Ask MS to update the data at unicode.org and if you are unsatisfied w/ the one that comes w/ Encode you are free to CPANize your version. But he has raised even more points and I was finally convinced. * Though not in unicode.org, MS has already made the mapping available in their web [2][3] * Python and Ruby will be using the MS version, not the one at unicode.org * Java has been known to suffer badly for confusing Shift_JIS and CP932 but Encode is already free of this problem by supplying different mappings for Shift_JIS and CP932. [0] http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/ CP932.TXT [1] http://www2d.biglobe.ne.jp/~msyk/perl/cp932.html [2] http://www.microsoft.com/typography/unicode/cscp.htm [3] http://www.microsoft.com/typography/unicode/932.txt One small but significant concern is Tcl/Tk; So far Encode's CP932 does match that of Tcl but not after my next release of Encode. So I decided to call for opinion before I commit the release. AFAIK, CPยฅd+ should be avoided for any data exchanged in the Net so you should not use it on the web or mails so it's perfectly all right if Tk(Web|Mail) has a problem handling them. At the same time Win32 Perl users would be much happier if CPยฅd+ are made more practical. The URI [2] also has links to other code pages so I would also like to review them and if neccessary, update them. 8 bit code pages (CP12??) seem OK but other CJK (CP9??) needs reviews. Dan the Encode Maintainer
Re: FW: ISO 8859-11 (Thai) cross-mapping table
On Tuesday, Oct 8, 2002, at 01:24 Asia/Tokyo, [EMAIL PROTECTED] wrote: I'll fix it but withhold from $Encode::VERSION++ since the table itself appears correct. But now we have no TIS620, then, so that needs to be added? Well, unless I hear requests from Thai native users, I'll abstain since TIS620 did not exist in http://www.unicode.org/Public/. So far as I see ISO-8859-11 suffices. But once again I am only human so correct me if I am wrong. Dan the Man with Too Many Encodings to Support; Too Many Typos Generated
[Encode] 1.77 Released
Porters, I am releasing Encode 1.77 to accommodate the up-and-coming changes to bleedperl that makes tr/// free of eval qq{} under use encoding pragma. This problem was addressed by me as; From: Dan Kogai [EMAIL PROTECTED] Date: Thu Oct 3, 2002 20:31:13 Asia/Tokyo To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: tr/// and use encoding Message-Id: [EMAIL PROTECTED] Whole Package -- to be uploaded to CPAN soon: http://www.dan.co.jp/~dankogai/Encode-1.77.tar.gz Patch against bleedperl (242 lines) for Hugo: http://www.dan.co.jp/~dankogai/current-1.77.diff.gz And here is the Changes. This one also includes minor alias fix by Autrijus. $Revision: 1.77 $ $Date: 2002/10/06 03:27:02 $ ! t/jperl.t * Modified to accomodate up and comming patch by Inaba-san that will fix tr/// needing eval qq{} Message-Id: [EMAIL PROTECTED] ! encoding.pm * pod fixes/enhancements to reflect the changes above ! lib/Encode/Alias.pm Encode::TW is correct, Encode::Alias not. - /Autrijus/ Message-Id: [EMAIL PROTECTED] Note this update alone will NOT fix the tr/// vs. use encoding problem noted above, the real fix needs to be applied to bleedperl (regcomp.c and such). The patch was already made available by Inaba Hiroto [EMAIL PROTECTED] (IsP for short)and my preliminary tests already show that it works with a minor fix to ext/Encode/t/jperl.t. Hence the update to Encode prior to bleedperl patch. Here is the proposed schedule before IsP goes into bleedperl; On Sunday, Oct 6, 2002, at 12:10 Asia/Tokyo, Jarkko Hietaniemi wrote: Okay, how about the schedule below? 0) I release IsP-safe Encode 1.77 to CPAN. All I have to do is to comment out the local(${^ENCODE}) black magic so it is still test-safe under perl 5.8.0 1) Hugo to sync bleedperl w/ Encode 1.77 2) New *.t for IsP under somewhere OTHER THAN ext/Encode/t. lib/nihongo.t, maybe? that is to test thoroughly what IsP brings. That test suite does not have to run on stock perl 5.8.0 3) With enough assurance w/ the test suite above, IsP goes into bleedperl. Sounds good to me. The Encode update complies step 0. Though this update is primarily for bleedperl, it is perl 5.8.0-safe. Please allow some time before step 2-3. I would like Inaba-san's help on 2 and I am rather busy recently Dan the Encode Maintainer
tr/// and use encoding
On Thursday, Oct 3, 2002, at 11:29 Asia/Tokyo, Jarkko Hietaniemi wrote: On Wed, Oct 02, 2002 at 10:44:06PM +0900, Dan Kogai wrote: On Wednesday, Oct 2, 2002, at 22:34 Asia/Tokyo, Jarkko Hietaniemi wrote: Both. I think the operation needed is straight-forward. When you get tr[LHS][RHS], decode'em then feed it to the naked tr// . Urk... That means a dip into the toke.c, how the tr/// ranges are implemented is... tricky. sv_recode_to_utf8() is needed somewhere... but I'm a little bit pressed for time right now. I suggest you perlbug this and move the process to perl5-porters. (Inaba Hiroto also might have insight on this; he's the tr///-with-Unicode sensei, really-- he practically implemented all of it. And he might read *[gk]ana much better than me :-) So now this thread is in perl5-porter. Since this undocumented (lack of) feature has a very easy workaround, I am yet to perlbug this. =head1 PROBLEM Cuse encoding 'foo-encoding' nicely converts string literals and regex into UTF-8 so you gen get the power of perl 5.8.0 even when your source code is other text encodings than UTF-8. But tr/// does not embrace this magic. =head1 WORKAROUND Suppose your script is in EUC-JP and your source contains this: $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/; And you want perl to do the following; $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/ All you have to do is: use encoding 'euc-jp'; # eval qq{ \$kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ }; =over =item chars in this example utf8 euc-jp charnames::viacode() - \x{3041} \xA4\xA1 HIRAGANA LETTER SMALL A \x{3093} \xA4\xF3 HIRAGANA LETTER N \x{30a1} \xA5\xA1 KATAKANA LETTER SMALL A \x{30f3} \xA5\xF3 KATAKANA LETTER N =backs =head1 DISCUSSION I found this when I was writing a CGI book and I wanted a form validation/correction. THe example above converts all Hiragana to Kanakana, which is a common task in Japan. Traditionally this kind of operation was done via jcode::tr() (require jcode.pl;) or Jcode::tr() (use Jcode;). But as of perl 5.6.0 you can apply Japanese directly into regex and tr/// -- so long as your script is in UTF-8. With perl 5.8.0, the direct application of multibyte regex was made possible via Cuse encoding pragma. use encoding pragma applies its magic as follows. Suppose you Cuse encoding 'foo'; =over =item 0. ${^ENCODING}, a special, non-scoped variable, is set to CEncode::find_encoding('foo'). if 'foo' is a supported encoding by Encode, ${^ENCODING} is now a transcoder object. =item 1. all string literals in q//, qq//, qw// and qr// (not sure of qx//) are first fed to ${^ENCODING}.-decode(). So from perl's point of view, it's the same as literals written in UTF-8. =item 2. Cbinmode STDIN, :encoding(foo); and Cbinmode STDIN, :encoding(foo) are implicitly applied So you can feed STDIN in enconding 'foo' and get STDOUT in encoding 'foo' =back Very clever and powerful. But 1. is not done to tr///. qq{} is under control of Cuse encoding so eval qq{} works as expected. Though the workaround is simple, easy and clever it still leaves inconsistency on how ${^ENCODING} gets used; It does indeed works on non-interpolated literals already. =head1 REPORTED BY Dan the Encode Maintainer Elt[EMAIL PROTECTED]gt
[FYI] use encoding 'non-utf8-encoding'; use CGI;
I am currently writing yet another CGI book. That is for the Japanese market and written in Japanese. So it is inevitable that you have to face the labyrinth of character encoding. Before perl 5.8.0, most book teaches how to handle Japanese in CGI goes as follows; * stick with EUC-JP. it does not poison perl like Shift_JIS. * use jcode.pl or Jcode.pm when you have to convert encoding. * you can use jcode::tr or Jcode-tr when you have to convert between Hiragana and Katakana fine, so far. But * totally forget regex unless you are happy with a very counter-intuitive measure illustrated in 6.18 of the Cookbook * if you are desperate in Kanji regex, use jperl instead. That has now changed with 'use encoding'. But when it comes to CGI, 'use encoding' alone will not cut it. But CGI.pm can handle multipart/form-data . Together you can use regex safely and intuitively without resorting to convert your CGI script to UTF-8. The 120-line script right after my signature illustrates that. Sorry, it contains some Japanese (or my point gets blurred). As you see, tr/// is not subject to the magic of 'use encoding'. jhi, have we made it so deliberately ? I am begging to think tr/// is happier to enbrace the power thereof. Still, it can be overcome by simple eval qq{} as illustrated. This much idiom would not hurt much, at least not as much as the Cookbook sample Dan the Transcoded Man #!/usr/local/bin/perl # # Save me in EUC-JP! use 5.008; use strict; use CGI; use CGI::Carp qw(fatalsToBrowser); our $Method = 'POST'; #our $Method = 'GET'; our $Enctype = 'multipart/form-data'; #our $Enctype = 'application/x-www-form-urlencoded'; our $Charset = 'euc-jp'; use encoding 'euc-jp'; my $cgi = CGI-new(); my %Label = ( name= '$BL>A0(B', kana= '$B%U%j%,%J(B', mailto = '$BEE;R%a!<%k(B', mailto2 = '$BEE;R%a!<%k(B($B3NG'(B)', tel = '$BEEOC(B', fax = '$B%U%!%C%/%9(B', zip = '$B")(B', address = '$B=;=j(B', comment = '$B$40U8+(B', ); unless ($cgi-param()){ print_input($cgi); }else{ my $kana = $cgi-param('kana'); $kana =~ s/[(J\(Bs$B!!(B]+//g; # beware of zenkaku space! eval qq{ (J\(B$kana =~ tr/$B$!(B-$B$s(B/$B%!(B-$B%s(B/ }; # $kana =~ tr/$B$!(B-$B$s(B/$B%!(B-$B%s(B/; # will not work but do you know why? $cgi-param(kana = $kana); print_output($cgi); } sub print_input{ my $c = shift; print_html( $c, title ="Form:$BF~NO(B", name= $c-textfield(-name = 'name'), kana= $c-textfield(-name = 'kana'), mailto = $c-textfield(-name = 'mailto'), mailto2 = $c-textfield(-name = 'mailto2'), tel = $c-textfield(-name = 'tel'), fax = $c-textfield(-name = 'fax'), zip = $c-textfield(-name = 'zip'), address = $c-textfield(-name = 'address'), comment = $c-textarea(-name = 'comment'), ); } sub print_output{ my $c = shift; print_html( $c, title = "Form:$B=PNO(B", name= $c-param('name'), kana= $c-param('kana'), mailto = $c-param('mailto'), mailto2 = $c-param('mailto2'), tel = $c-param('tel'), fax = $c-param('fax'), zip = $c-param('zip'), address = $c-param('address'), comment = $c-param('comment'), ); }; sub print_html{ my $c = shift; my %arg = @_; print $c-header(-charset = $Charset), $c-start_html(-title = $arg{title}), $c-h1($arg{title}); $c-param() or print $c-start_form(-method = $Method, -enctype = $Enctype); print $c-start_table({border = 1}), $c-Tr([ $c-td([ $Label{name}= $arg{name} ]), $c-td([ $Label{kana}= $arg{kana} ]), $c-td([ $Label{mailto} = $arg{mailto} ]), $c-td([ $Label{mailto2} = $arg{mailto2} ]), $c-td([ $Label{tel} = $arg{tel} ]), $c-td([ $Label{fax} = $arg{fax} ]), $c-td([ $Label{zip} = $arg{zip} ]), $c-td([ $Label{address} = $arg{address} ]), $c-td([ $Label{comment} = $arg{comment} ]), ]); if ($c-param()){ print $c-td($c-a({href=$ENV{SCRIPT_TEXT}}, "Retry")); }else{ print $c-td([$c-reset(), $c-submit()]), }; print $c-end_form() unless $c-param(); print $c-end_table(), $c-end_html(); } __END__
Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;
On Wednesday, Oct 2, 2002, at 22:15 Asia/Tokyo, Jarkko Hietaniemi wrote: (Hi, it's me again...) Are you doing character ranges in the tr/// under 'use encoding'? (I'm asking because I see a - in the middle of what I assume is mangled EUC-JP) Yes. that's where hiragana - katakana conversion is attempted; English equivalent of tr/A-Z/a-z/. Dan
Re: [FYI] use encoding 'non-utf8-encoding'; use CGI;
On Wednesday, Oct 2, 2002, at 21:51 Asia/Tokyo, Jarkko Hietaniemi wrote: However, I will need to stare at your example some more, since for simpler cases I think tr/// *is* obeying the 'use encoding': use encoding 'greek'; ($a = \x{3af}bc\x{3af}de) =~ tr/\xdf/a/; print $a, \n; This does print abcade\n, and it also works when I replace the \xdf with the literal \xdf. I can explain that. \x{3af}bc\x{3af}de is is a string literal so it gets encoded. however, my example in escaped form is; $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ which does not get encoded. the intention was; $kana =~ tr/\x{3041}-\x{3093}/\x{30a1}-\x{30f3}/ That's why eval qq{ $kana =~ tr/\xA4\xA1-\xA4\xF3/\xA5\xA1-\xA5\xF3/ } works because \xA4\xA1-\xA4\xF3 and \xA5\xA1-\xA5\xF3 are converted. to \x{3041}-\x{3093} and \x{30a1}-\x{30f3}, respectively. Dan
Re: Encode functionality for Perl 5.6.1
On Saturday, Sep 21, 2002, at 22:38 Asia/Tokyo, Robert Allerstorfer wrote: Hi, the great Encode module requires perl 5.8. Are there any backports existing yet that may work with 5.6.1? I am trying to find a solution to encode Japanese (shiftjis) and Chinese (gb2312 and big5) into utf8 with that perl version since 5.8 is not yet used widely, unfortunately. Okay, let me repeat what I have said in this mailing list before. 0) Backporting Encode to 5.6.1 and perhaps 5.00503 was my first intention when I joined (and later took over) the development thereof 1) Then I found Unicode stuff in 5.6.1 is very kaputt. At the same time Encode was made very perl-5.8.0 dependent especially unicode handling 2) So I concluded I would rather advocate perl 5.8 than pay some effort to backport Encode. 3) Efforts by others to backport is welcome, provided a) if it uses 'Encode' as module name it needs to work both in 5.8 and 5.6.1. Bottom line is that backported version will not breach what it is now. If it ain't broke, don't fix it (and 5.6.1 was broke Unicode-wise) b) if you just implemented Encode functionality in perl 5.6.1 but incompatible w/ 5.8, give it a different name; i.e) Encode::Compat c) at any rate don't forget to share your idea and work here at [EMAIL PROTECTED] Dan the Encode Maintainer
[Encode] enc2xs fixes
On Sunday, Sep 1, 2002, at 18:10 Asia/Tokyo, Andreas J. Koenig wrote: Apparently I'm missing something. 'make manifest' is only good for the expert, because you need to know which lines you have to delete from MANIFEST after running 'make manifest'. That's not obvious. If you know a trivial way to write a correct MANIFEST file, I'd be grateful if you could add it. Maybe then some of my documentation changes are not needed. Right. MANIFEST generation is not as trivial as it looks and 'make manifest' does it trivially-too trivially [snip] But I am not so happy if enc2xs becomes behemoth like h2xs. Man, I thought I had soothed all the feeping creaturism before the release of 5.8.0 Please see if the patch below makes it a behemoth. It does just the following: Add Usage(). Add Version(). Add -h and -v. Die on an insufficient commandline. Removed ARGV from the arguments to call the to make_configlocal_pm(). Document what find_e2x does in a comment. Speed up find_e2x considerably. Tweak documentation. Thank you. Now your patch is in my repository. But since we are not in urgency, I would like to wait for NI-XS for his tweaks/fix for 'use encoding utf8' workaround. Besides, I am a little busy taming Jaguar right now... Dan the Encode Maitainer.
Re: Encode 1.76 Released
On Friday, August 30, 2002, at 08:48 , Andreas J. Koenig wrote: Hi Dan, today I revisited enc2xs and found three things missing: Okay - enc2xs doesn't write a MANIFEST file: this would be handy as the innocent user doesn't know which files need to be included in a distribution I reckon your suggestion is to 'let enc2xs generate any missing files that are enough to CPANize the encoding'. is there any other file missing? MANIFEST autogeneration is trivial but I still prefer 'make manifest' - no -h or --help option I'm not sure enc2xs be used frequently enough to call for the need for -h but there is no reason not to add one. Maybe detailed info on -M and very, very brief description on -o and such that are not supposed to be invoked by human (even I don't do that except for debugging). - no -v or --version option This should return the version of Encode.pm as well as enc2xs itself. I'd volunteer to add all that if you'd be inclined to accept (and proofread) it. Please let me know what you think. I am glad you help me out (well, your trust level is so high that you can do all the work and all I do is put your new version to my repository -- and claim the credit (c) ams). But I am not so happy if enc2xs becomes behemoth like h2xs. Man, I thought I had soothed all the feeping creaturism before the release of 5.8.0 Dan the Patient of Feeping Creaturism
Re: translating the Perl 5.8.0 announcement to CJK
On Friday, July 19, 2002, at 04:27 AM, Jarkko Hietaniemi wrote: Final round of proofreadings, if I may: http://www.iki.fi/jhi/pl580.txt.big5.tw http://www.iki.fi/jhi/pl580.txt.euc.cn http://www.iki.fi/jhi/pl580.txt.euc.jp http://www.iki.fi/jhi/pl580.txt.euc.kr Looks like miyagawa-kun's patch is already in for .jp. Looks good. Be it the final version. Dan
Re: translating the Perl 5.8.0 announcement to CJK
jhi and any that grok Nihongo, On Wednesday, July 17, 2002, at 02:41 AM, Jarkko Hietaniemi wrote: Neat. I now have the Chinese and Korean ones online: http://www.iki.fi/jhi/pl580.txt.cn http://www.iki.fi/jhi/pl580.txt.tw http://www.iki.fi/jhi/pl580.txt.kr Here is the Japanese version at last. http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp I decided to post this to [EMAIL PROTECTED] because I have fever and my brain runs on single-digit percentage point of its capacity so I am not as sure of the quality of document as usual. If you are able to grok Japanese please feel free to polish the doc. Yoroshiku Onegaishimasu! Dan the Sneezing Translator of Yours
Re: translating the Perl 5.8.0 announcement to CJK
On Thursday, July 18, 2002, at 02:31 AM, Jarkko Hietaniemi wrote: Notice the name changes. I also edited away the DJGPP broken entry since that's now fixed. I notice that the VMS section in the .jp one is untranslated, and the .kr is still missing the new Unicode section. http://www.iki.fi/jhi/pl580.txt.big5.tw http://www.iki.fi/jhi/pl580.txt.euc.cn http://www.iki.fi/jhi/pl580.txt.euc.jp http://www.iki.fi/jhi/pl580.txt.euc.kr Thus fixed. I've also corrected linefeeds so it looks better via web browsers. Get the newer version one via http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp Dan the Translator of Yours
Re: translating the Perl 5.8.0 announcement to CJK
On Thursday, July 18, 2002, at 01:34 PM, Autrijus Tang wrote: I'm extremely sorry to bother you again with nitpicking, but that version did not contain the TraditionalSimplified Chinese fix I posted earlier. http://www.autrijus.org/tmp/pl580.txt.euc.jp http://www.autrijus.org/tmp/pl580.txt.utf8.jp Has it corrected. now http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.euc.jp http://www.dan.co.jp/~dankogai/bleedperl/pl580.txt.utf8.jp are identical to the ones above (I've lwp-downloaded'em :). Xiexie and Kiitos. Dan the Corrected Man
[Encode] 1.75 Released
On Sunday, June 2, 2002, at 02:24 AM, Jarkko Hietaniemi wrote: I say GRRR. Be quick about it. I'm already wrapping things up, but I guess a few aliases won't break this camel's back too much. Ignorance is Bliss, the phrase hit me when I released Encode-1.75, available as follows; Whole: http://www.dan.co.jp/~dankogai/Encode-1.75.tar.gz and CPAN Diff against current: (191 lines) http://www.dan.co.jp/~dankogai/current-1.75.diff.gz And Changes $Revision: 1.75 $ $Date: 2002/06/01 18:07:49 $ ! lib/Encode/Alias.pm t/Alias.t lib/Encode/Supported.pod TW/TW.pm glibc compliance cited by Autrijus. http://www.li18nux.org/docs/html/CodesetAliasTable-V10.html ! bin/enc2xs bin/piconv Subject: Re: forewarning: usedevel and versiononly Message-Id: [EMAIL PROTECTED] Autrijus, please for pumpkin's sake don't find anything too irresistible to leave it unfixed :-P Dan the Encode Maintainer
Re: ICU and Parrot
On Saturday, June 1, 2002, at 12:34 AM, Autrijus Tang wrote: On Fri, May 31, 2002 at 06:18:55AM +0900, Dan Kogai wrote: As a matter of fact GB18030 is ALREADY supported via Encode::HanExtra by Autrijus Tang. The only reason GB18030 was not included in Encode main is sheer size of the map. Yes, partly because it was not implemented algorithmically. :) I was browsing http://www-124.ibm.com/cvs/icu/charset/data/ucm/ and toying with uconv, and wondered: 1) Does Encode have (or intend to have) them all covered? No, Unless they appear in www.unicode.org. Though some of them are actually adopted. Useful it may be I found raw ICM too Big and too Blue :) 2) If not, would a Encode::ICU be wise? I'm not so sure. But if I were the one to implement Encode::ICU, it will not be just a compiled collection of UCM files but a wrapper to all library functions that ICU has to offer. I, for one, am too lazy for that. 3) A number of encodings are in HanExtra but not their ucm repository, namedly big5plus, big5ext and cccii. Is is wise to feed back to them under the name of e.g. perl-big5plus.ucm? You should in time and I should, too, because I have expanded UCM a little so that you can define combined characters commonly seen in Mac*. But I don't see any reason to be in hurry for the time being. If any of you are a member of team ICU you may redirect this dialogue to your team so we can work together in future (after 5.8.0, that is). Dan the Encode Maintainer
Re: [PATCH] Encode::MIME::Header
On Monday, May 20, 2002, at 11:39 AM, Tatsuhiko Miyagawa wrote: charsets can include _ in its name. Here's a patch. Thanks, applied. With patches from Autrijus and you I think I now I have enough diff to justify the version increment of Encode. Next version within 24 hours. Oh, VMS is still on to do list... Dan the Encode Maintainer -- Tatsuhiko Miyagawa [EMAIL PROTECTED] --- Header.pm~Sun May 5 01:41:30 2002 +++ Header.pm Mon May 20 11:34:39 2002 @@ -51,7 +51,7 @@ $str =~ s{ =\? # begin encoded word - ([0-9A-Za-z\-]+) # charset (encoding) + ([0-9A-Za-z\-_]+) # charset (encoding) \?([QqBb])\? # delimiter (.*?)# Base64-encodede contents \?= # end encoded word
Re: Acceptance of Unicode (UTF8) in Far East
On Thursday, May 16, 2002, at 03:04 AM, Mark Lewellen wrote: Hi all- I have a question directed mostly at those involved in the Far East. Since Unicode is often implemented in UTF8, and UTF8 uses 3 bytes for Chinese characters (instead of the 2 bytes in Chinese and Japanese GB, Big5, JIS), UTF8 documents solely in these languages will be 50% larger. This appears to be a large stumbling block to universal acceptance of UTF8. Is there much resistance to UTF8 in the Far East, are there work-arounds to the problem, and are many people even aware of the problem? Mark Size of data is not a big deal these days with data compression and faster network. So far as I see there are very few who dislike UTF-8 because of the size bloats. Most of objections and dislikes against Unicode is more of politics and culture. Whether you like it or not, the Unicodization is steady because it is already blessed by Windows and MacOS (X). And you have virtually no choice but to use Unicode when you program in Java. But the Unicodization of applications have only begun. UTF-8 mails and web pages are still rare mainly because of lack of tools (well, as a matter of fact many of these tools do support Unicode but simple don't make UTF-8 a default when it sends or saves data). And even if tools are there it may still take a long time before data get converted to UTF-8. Unless you need to save more than 3 languages legacy encodings do suffice and many may still choose to save new data in legacy encodings for legacy applications. To me it is okay whether you choose to save your data in whichever encoding so long as I can read. That's why I became a maintainer of Encode module, a standard part of Perl 5.8 that enables you to do so. Dan the Encode Maintainer
Re: use encoding in both scripts and modules
On Monday, May 6, 2002, at 05:16 , Tatsuhiko Miyagawa wrote: panic happens while hacking with encoding pragma. It seems use encoding is still in effect after you 'use EncBar'. Simply commenting out 'use encoding 'euc-jp'' in encoding-test.pl makes the program work as expected. Dan
Re: [PATCH] Encode::Encoding
On Monday, May 6, 2002, at 06:51 , Tatsuhiko Miyagawa wrote: package Encode::MyEncoding; use base qw(Encode::Encoding); __PACKAGE__-Define(qw(myCanonical myAlias)); dies saying: Error: Undefined subroutine Encode::define_encoding called at ... Patch follows after sig. Thanx. Applied. Dan the Encode Maintainer
Re: [preannounce] Encode::Punycode
On Monday, May 6, 2002, at 07:11 , Tatsuhiko Miyagawa wrote: I've just made Encode implementation for Punycode[1]. (Does it make any sense to make such an encodings as subclass of Encode::Encoding? I think it's reasonable, as there's Encode::MIME::Header!) I bet you do that sooner or later. Thanks! As for module hierarchy, your choice is perfectly valid and it is even NI-XS recommendation. Dan the Encode Maintainer
[Encode] 1.69 Released
I hope it was not too premature to release Encode-1.69, now available as follows. Whole: http://www.dan.co.jp/~dankogai/Encode-1.69.tar.gz and CPAN Diff against current: 180 lines http://www.dan.co.jp/~dankogai/current-1.69.diff.gz And here are Changes $Revision: 1.69 $ $Date: 2002/05/04 16:41:18 $ ! lib/Encode/MIME/Header.pm Floating-point coerced for UNICOS (in integer arithmetics it folds line one character too early). Verification by Mark is pending. Message-Id: [EMAIL PROTECTED] ! Unicode/Unicode.pm more doc patch from Elizabeth Message-Id: [EMAIL PROTECTED] ! Encode/Makefile_PL.e2x More platform-independent patch from Benjamin Message-Id: [EMAIL PROTECTED] ! lib/Encode/Guess AUTHORS split regex fix by Graham Barr. Adds him to AUTHORS. Message-Id: [EMAIL PROTECTED] ! Encode/Makefile_PL.e2x enc2xs script discovery made smarter and more sensible, first cited by Miyagawa-kun and further suggestions by Rafael and Andreas ! Encode.pm lib/Encode/Guess.pm t/fallback.t t/guess.t t/mime-header.t The EBCDIC remapping of the low 256 bites again #16372 by jhi UNICOS needs verification by Mark or others. I am fairly sure of the cause of failure and equally sure of the fix. But I am not sure if UNICOS groks what I mean... Dan the Encode Maintainer P.S. I think we are very, very close to 5.8.0-RC1 but will we make it on May 8th so we can claim we are only a month behind ?
[Encode] 1.68 Released
I am delighted to add the first female to AUTHORS when I released Encode, available as follows; Whole: http://www.dan.co.jp/~dankogai/Encode-1.68.tar.gz Diff against current: 106 lines http://www.dan.co.jp/~dankogai/current-1.68.diff.gz Changes is just one paragraph long. $Revision: 1.68 $ $Date: 2002/05/03 12:20:13 $ ! lib/Encode/Alias.pm lib/Encode/Supported.pod t/Alias.t AUTHORS UCS-4 added to aliases of UTF-32 by Elizabeth Mattijsen. Alias.t and Supported.pod modified to reflect the change. Elizabeth added to Authors. And H.M. is also added for forwarding her patch among other contributions (I was rather surprised to find his name was not there yet!) Message-Id: [EMAIL PROTECTED] .if there is one kind of diversity that is lacking in Perl, it is definitely sex ratio. In terms of the sheer number of sex it is already diverse than an ordinary world for Perl mongers have female, male, and the Borg, :P Dan the Encode Maintainer / the Equal Opportunity Whippee
[Encode] 1.67 released
I wonder how's Laszlo's doing when I released Encode 1.67, available as follows. Whole: http://www.dan.co.jp/~dankogai/Encode-1.67.tar.gz and CPAN Diff agains current: 147 lines http://www.dan.co.jp/~dankogai/current-1.67.diff.gz And Changes. As you see changes are just cosmetic. $Revision: 1.67 $ $Date: 2002/05/02 07:33:09 $ ! Encode.xs Error message now consistent w/ perlqq (\N{U+} - \x{}) done in perl@16308 but Philip linted me further. Now the error messages are macronized as ERR_ENCODE_NOMAP and ERR_DECODE_NOMAP ! lib/Encode/Guess.pm Sanity check for happier -w by Autrijus Dan the Encode Maitainer
Encode-InCharset-0.01 Released
I have just released Encode-InCharset-0.01, available as http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN. I have developed this module primarily to implement ISO-2022-JP-3 and ISO-2022-CN in future. To implement encode() in these, you have to know which character set a given character belongs. But this module can also be used if a string can safely be encoded (Though fallback is much faster). Dan the Encode Maintainer NAME Encode::InCharset - defines \p{InCharset} INSTALL perl Makefile.PL make test make install SYNOPSIS use Encode::InCharset qw(InJIS0208); I am \x{5c0f}\x{98fc}\x{3000}\x{5f3e} =~ /(\p{InJIS0208})+/o; # guess what is in $1 ABSTRACT This module provides In*Charset* Unicode property that matches characters *Charset*. As of this writing, Property-matching functions are auto-generated out of ucm files in Encode, Encode::HanExtra, and Encode::JIS2K. DESCRIPTION As of this writing, this module supports character properties shown below. Since names are self-explanatory I am not going to discuss in details. InASCII InAdobeStandardEncoding InAdobeSymbol InAdobeZdingbat InBIG5EXT InBIG5PLUS InBIG5_ETEN InBIG5_HKSCS InCCCII InCP1006 InCP1026 InCP1047 InCP1250 InCP1251 InCP1252 InCP1253 InCP1254 InCP1255 InCP1256 InCP1257 InCP1258 InCP37 InCP424 InCP437 InCP500 InCP737 InCP775 InCP850 InCP852 InCP855 InCP856 InCP857 InCP860 InCP861 InCP862 InCP863 InCP864 InCP865 InCP866 InCP869 InCP874 InCP875 InCP932 InCP936 InCP949 InCP950 InDingbats InEUC_CN InEUC_JISX0213 InEUC_JP InEUC_KR InEUC_TW InGB12345 InGB18030 InGB2312 InGSM0338 InHp_Roman8 InISO_8859_1 InISO_8859_10 InISO_8859_11 InISO_8859_13 InISO_8859_14 InISO_8859_15 InISO_8859_16 InISO_8859_2 InISO_8859_3 InISO_8859_4 InISO_8859_5 InISO_8859_6 InISO_8859_7 InISO_8859_8 InISO_8859_9 InISO_IR_165 InJIS0201 InJIS0208 InJIS0212 InJIS0213_1 InJIS0213_2 InJohab InKOI8_F InKOI8_R InKOI8_U InKSC5601 InMacArabic InMacCentralEurRoman InMacChineseSimp InMacChineseTrad InMacCroatian InMacCyrillic InMacDingbats InMacFarsi InMacGreek InMacHebrew InMacIcelandic InMacJapanese InMacKorean InMacRoman InMacRomanian InMacRumanian InMacSami InMacSymbol InMacThai InMacTurkish InMacUkrainian InNextstep InPOSIX_BC InShift_JIS InShift_JISX0213 InSymbol InVISCII EXPORT # will import all of them use Encode::InCharset; # will import only properties in qw() use Encode::InCharset qw(InCharset...) SEE ALSO the Encode manpage, the perlunicode manpage AUTHOR Dan Kogai [EMAIL PROTECTED] COPYRIGHT AND LICENSE Copyright 2002 by Dan Kogai This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html
Re: [PATCH] Let Guess.pm handles uninitialized argument.
On Wednesday, May 1, 2002, at 09:19 , Autrijus Tang wrote: This way is self-descriptory; it makes -w happier. :) /Autrijus/ XieXie. Applied. Dan the Encode Maintainer
Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 10:30 , Jarkko Hietaniemi wrote: Thanks, upgraded. A bit of noise from ext/PerlIO/t/fallback.t: ./perl -Ilib ext/PerlIO/t/fallback.t 1..8 ok 1 - opened iso-8859-1 file \N{U+20ac} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21. ok 2 - perlqq escapes ok 3 - opened iso-8859-1 file ok 4 - HTML escapes ok 5 - Opened as ASCII # 5c ok 6 - Escaped non-mapped char ok 7 - Opened as ASCII # fffd ok 8 - Unicode replacement char Also, is it intentional that there is no \N{U+} syntax...? That was planned at some point but as of there is no such thing Okay, I'll change the error message in the next one so it would say \x{abcd} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21. Autrijus just sent me a patch so it won't take long. ./perl -Ilib -Ilib -Mcharnames=:full -e '\N{U+20ac}' Unknown charname 'U+20ac' at lib/unicore/Name.pl line 1 Why not just use \x{...}? If that's PERLQQ, that's what I would expect? Speaking of charnames and utf8heavy, charname::viacode() is incredibly slow (I tried to use it extensively to pretty-comment ucm files. I gave up and used quicker and dirtier approach originally by NI-XS) and I don't really like how unicore/ is laid out. We can at least make use of AnyDBM_File (the key-value pairs needed there is totally SDBM_File safe so we can safely use it!) or if we can spend more memory, Storable. return 'END' 0 END is totally counterintuitive and the whitespace in between must be exactly a single '\t' and that sucks (I've been annoyed why my test script on InMyOwnDefinition didn't work as expected). I would like to make this a 5.8.1 todo of mine. Dan the Encode Maintainer
Re: Encode, charnames and utf8heavy
On Wednesday, May 1, 2002, at 10:57 , Dan Kogai wrote: Okay, I'll change the error message in the next one so it would say \x{abcd} does not map to iso-8859-1 at ext/PerlIO/t/fallback.t line 21. Autrijus just sent me a patch so it won't take long. Done in my repository. Was piconv5.7.3 -c -f utf8 -t ascii t/jisx0201.utf \N{U+ff61} does not map to ascii, 134 at /home/dankogai/lib/perl5/5.7.3/i386-freebsd/Encode.pm line 175, line 1. Is bleedperl -Mblib `which piconv5.7.3` -c -f utf8 -t ascii t/jisx0201.utf \x{ff61} does not map to ascii at /usr/home/dankogai/work/Encode/blib/lib/Encode.pm line 175, line 1. Is there anything I should fix before Encode 1.67 ? (ahem, besides djgpp which I am still waiting for the news from Laszlo) Dan the Encode Maintainer
Re: [Encode] euc-jp vs euc-jisx0213
On Monday, April 29, 2002, at 07:38 , SADAHIRO Tomoyuki wrote: I doubt whether users of 'euc-jp' will assume it to be a combination with JIS X 0213. They don't have to because 'euc-jp' behaves exactly the same as before so long as the charset is in ASCII/JISX(0201|0208|0212). Such a mixing would prevent warning/croaking for appearance of code points that are not defined originally (meaning w/o X 0213), wouldn't it? That was my biggest concern but I have decided to go ahead with euc-jp to (partially) support JIS X 0213 and the reason is simple; Encode::JP is already too big to differentiate between various euc-jp. In such cases, we should settle for the most 'comprehensive' version. Even the term 'euc-jp' is too ambiguous for many; At first it didn't include G3 and some say they must be clearly marked as something like 'euc-jp-classic' (no 0212 support) vs 'euc-jp-modern' and so forth (then our current euc-jp should be marked as 'euc-jp-postmodern' :). It would be nice if we can go that way like 7bit-JIS/ISO-2022-JP/ISO-2022-JP-1 but for euc-jp we have to have a whole ucm for each. This is definitely a todo for Perl 5.8.1 and up and I have already come up with a solution; the future Encode (Encode II) will support CES-generator; that is, you can express euc-jp not as a whole big table but a combination of tables. That will also reduce the duplicates found in vendor mappings. It will be a complete rewrite of encengine.c But that requires not only codes but the expansion of UCM format so give me more time (and Perl 5.8.0!) Dan the Encode Maintainer
[Encode] Encode-JIS2K-0.01 uploaded to CPAN
Folks, I gotta go in 5 minutes so I just dump the README file after the sig. Dan the Encode Maintainer NAME Encode::JIS2K - JIS X 0212 (aka JIS 2000) Encodings INSTALLATION To install this module type the following: perl Makefile.PL make make test make install SYNOPSIS use Encode::JIS2K; use Encode qw/encode decode/; $euc_2k = encode(euc-jisx0213, $utf8); $utf8 = decode(euc-jisx0213, $euc_jp); ABSTRACT This module implements encodings that covers JIS X 0213 charset (AKA JIS 2000, hence the module name). Encodings supported are as follows. Canonical Alias Description euc-jisx0213 qr/\beuc.*jp[ \-]?(?:2000|2k)$/i EUC-JISX0213 qr/\bjp.*euc[ \-]?(2000|2k)$/i qr/\bujis[ \-]?(?:2000|2k)$/i shiftjisx0123 qr/\bshift.*jis(?:2000|2k)$/i Shift_JISX0213 qr/\bsjisp \-]?(?:2000|2k)$/i iso-2022-jp-3 jis0213-1-raw JIS X 0213 plane 1, raw format jis0213-2-raw JIS X 0213 plane 2, raw format DESCRIPTION To find out how to use this module in detail, see the Encode manpage. what is JIS X 0213 anyway? Simply put, JIS X 0213 is a rework and reorganization of JIS X 0208 and JIS X 0212. They consist of two 94x94 planes which roughly corrensponds as follows; JIS X 0213 Plane 1 = JIS X 0208 + extension JIS X 0213 Plane 2 = JIS X 0212 reorganized + extension And here is the character repertoire there of at a glance. # of codepoints Kuten Ku (rows) used JIS X 0208 6,8791..8,16..83 JIS X 0213-1 8,7621..94 (all!) JIS X 0212 6,0672,6..7,9..11,16..77 JIS X 0213-2 2,4361,3..5,8,12..15,78..94 --- (JIS X0213 Total) 11,197 JIS X 0213 was designed to extend JIS X 0208 and JIS X 0212 without being imcompatible to (classic) EUC-JP and Shift_JIS. The following characteristics are as a result thereof. o JIS X plane 1 is (almost) a superset of JIS X 0208. However, with Unicode 3.2.0 the mappings differ in 3 codepoints. Kuten JIS X 0208 - Unicode JIS X 0213 - Unicode -- 1-1-17 UFFE3 # FULLWIDTH MACRONU203E # OVERLINE 1-1-29 U2014 # EM DASH U2015 # HORIZONTAL BAR 1-1-79 UFFE5 # FULLWIDTH YEN SIGN U00A5 # YEN SIGN o By the same token, JIS X 0213 plane 2 contains JIS Dai-4 Suijun Kanji (JIS Kanji Repertoire Level 4). This allows EUC-JP's G3 to contain both JIS X 0212 and JIS 0213 plane 2. However, JIS X 0212:1990 already contains many of Dai-4 Suijun Kanji so EUC's G3 is subject to containing dupli- cate mappings. o Because of Halfwidth Katakana, Shift_JIS mapping has been tricky and it is even trickier. Here is a regex that matches Shift_JISX0213 sequence (note: you have to use bytes to make it work!) $re_valid_shifjisx0213 = qr/^(?: [\x00-\x7f] |# ASCII or [\xa1-\xdf] |# JIS X 0201 KANA or [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # JIS X 0213 )+$/xo; Note on EUC-JISX0213 (vs. EUC-JP) As of Encode-1.64, 'euc-jp' does support euc-jisx0213 for decoding. However, 'euc-jp' in Encode and 'euc-jisx0213' differ as follows; euc-jp euc-jisx0213 -- Decodes (0201-K|0208|0212|0213) ditto Round-Trip (|0) (020-K|0208|0212)JIS X (0201-K|0213) Decode Only (|3) those only found in 0213 those only found in 0212 -- AUTHORS Dan Kogai [EMAIL PROTECTED] COPYRIGHT Copyright 2002 by Dan Kogai [EMAIL PROTECTED]. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html SEE ALSO the Encode manpage, the Encode::JP
Re: Encode doesn't like undef
On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote: This is with Encode 1.64 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)' Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. I don't know Encode well enough to check if there are any other places this will strike. I think we'd better leave that way; It needs a PV to (en|de)code so consider this a feature. Of course perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()' is perfectly safe and legal. Dan the Encode Maintainer
Re: Encode doesn't like undef
On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote: I agree that passing undef() to one of the encoding functions may be an edge condition too far, but passing a variable that contains undef is more common. $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)' Name main::a used only once: possible typo at -e line 1. Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. Can this be detected silenced? You've got a point. Warning should warn when and only when there is a danger therein and passing undef itself is harmless. And this can be done easily by adding defined $str or return; for each sub concerned. Okay, I'll go for that. Dan the Encode Maintainer
Encode should stay undefphobia
On Wednesday, May 1, 2002, at 02:10 , Nick Ing-Simmons wrote: Dan Kogai [EMAIL PROTECTED] writes: Please don't. $a =~ tr/A/a/; gives a warning so should encode/decode. How can I be so dumb for not anticipating you say that! (Blame it on the fever). Paul, I now think Nick's got more points than yours so I will revert it in the next version. Maybe I will document this undef-phobia of Encode subs in the POD Dan the Warned Man
[Encode] 1.66 Released
My fever is down at last when I released Encode-1.66, available as follows; Whole: http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN Diff against current: 264 lines http://www.dan.co.jp/~dankogai/current-1.66.diff.gz And Changes. $Revision: 1.66 $ $Date: 2002/05/01 05:41:06 $ ! Encode.xs t/fallback.t WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning while fallback is in effect. This even came with a welcome side-effect of cleaner code with less nests! Thank you, NI-XS. t/fallback.t is also modified to test this. And of course, the corresponding varialbles to UV[Xx]f are appropriately cast. This should've concluded NI-XS homework. ! Encode.pm encode(undef) does warn again! Repented upon suggestion by NI-XS. Document for unless vs. '' added Message-Id: [EMAIL PROTECTED] As you see, this is a NI-XS homework issue. Now I have only djgpp to left (I think. djgpp is just s slow on my env.) Dan the Encode Maintainer
http://bleedperl.dan.co.jp:8080/
I have set up an experimental mod_bleedperl server which URI is shown in the subject. To demonstrate the power of Perl 5.8, I have written a small cgi/pl (.pl runs on Apache::Registry) called piconv.pl, a web version of piconv(1). http://bleedperl.dan.co.jp:8080/piconv/ (Don't forget :8080; it's not run on root!) What's so funny is that this service can be used to 'asciify' non-ascii web pages. Bart's idea of HTMLCREF is fully exploited here. To find it out, try http://bleedperl.dan.co.jp:8080/piconv/piconv.pl?f=euc- jpt=asciiu=www.yahoo.co.jp Then http://bleedperl.dan.co.jp:8080/piconv/piconv.pl?f=euc- jpt=asciio=plainu=www.yahoo.co.jp Dan the Network Consultant by Trade
Unicode::Unihan 0.01 uploaded to CPAN
I have made a perl module called Unicode::Unihan, a module which makes accessing the Unihan DB very easily. Readme after my sig. As for the copyright and such I've read thru the original Unicode-Unihan-3.2.0 and I concluded I have no problem publicizing this but if it does infringe any of such, tell me and I'll remove it from CPAN. Dan the Open Source Developer -- _ Dan Kogai __/ CEO, DAN co. ltd. /__ /-+-/ 2-8-14-418 Shiomi Koto-ku Tokyo 135-0052 Japan /--/--- mailto: [EMAIL PROTECTED] / http://www.dan.co.jp/ - __/ /Tel:+81 3-5665-6131 Fax:+81 3-5665-6132 GPG Key: http://www.dan.co.jp/~dankogai/dankogai.gpg.asc Unicode::Unihan === INSTALLATION To install this module type the following: perl Makefile.PL make make test make install DEPENDENCIES This module requires perl 5.6 or better. NAME Unicode::Unihan - The Unihan Data Base 3.2 SYNOPSIS use Unicode::Unihan; my $db = new Unicode::Unihan; print join(, = $db-Mandarin(\x{5c0f}\x{98fc}\x{5f3e}), \n; ABSTRACT This module provides a user-friendly interface to the Uni- code Unihan Database 3.2. With this module, the Unihan database is as easy as shown in the SYNOPSIS above. DESCRIPTION The first thing you do is make the database available. Just say use Unicode::Unihan; my $db = new Unicode::Unihan; That's all you have to say. After that, you can access the database via $db-tag($string) where tag is the tag in the Unihan Database, without 'k' prefix. $data = $db-tag($string) =item @data = $db-tag($string) The first form (scalar context) returns the Unihan Database entry of the first character in $string. The second form (array context) checks the entry for each character in $string. @data = $db-Mandarin(\x{5c0f}\x{98fc}\x{5f3e}); # @data is now ('SHAO4 XIAO3','SI4','DAN4') @data = $db-JapaneseKun(\x{5c0f}\x{98fc}\x{5f3e}); # @data is now ('CHIISAI KO O','KAU YASHINAU','TAMA HAZUMU HIKU') SEE ALSO the perlunintro manpage the perlunicode manpage The Unihand Database, in Text http://www.unicode.org/Public/3.2-Update/Uni- han-3.2.0.txt.gz AUTHOR For the Module: Dan Kogai [EMAIL PROTECTED] For the Source Data: Unicode, Inc. COPYRIGHT AND LICENSE For the Module: Copyright 2002 by Dan Kogai, All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. For the Source Data: Copyright (c) 1996-2002 Unicode, Inc. All Rights reserved. Name: Unihan database Unicode version: 3.2.0 Table version: 1.1 Date: 15 March 2002
[Encode] 1.61 released
I know we are one more step closer to 5.8 when I released Encode 1.61, available as follows; Whole: http://www.dan.co.jp/~dankogai/Encode-1.61.tar.gz and CPAN Diff against current (840 lines) http://www.dan.co.jp/~dankogai/current-1.61.diff.gz And changes. $Revision: 1.61 $ $Date: 2002/04/26 03:02:04 $ ! t/mime-header.t Now does decent tests besides use_ok() ! lib/Encode/Guess.pm t/guess.t UI streamlined, document added ! Unicode/Unicode.xs various signed/unsigned mismatch nits (#16173) http://public.activestate.com/cgi-bin/perlbrowse?patch=16173 ! Encode.pm POD: utf8-flag-related caveats added. A few sections completely rewritten. ! Encode.xs ! AUTHORS Thou shalt not assume %d works, either! Robin Baker added to AUTHORS for this Message-Id: [EMAIL PROTECTED] ! t/CJKT.t Change 16144 by gsar@onru on 2002/04/24 18:59:05 Dan the Encode Maintainer
Re: Practical problems with custom .ucm based encoding
On Wednesday, April 24, 2002, at 09:25 , Bart Schuller wrote: Hello, The cool Encoding support in 5.8 to be enables me to properly solve a very common task: making HTML entities out of utf-8 data. I generated a ucm file with entries like this: U00A0 \x26\x6E\x62\x73\x70\x3B |0 # nbsp The resulting Encode::HTMLEntities encoding works perfectly. However, I want it to do more. Not every unicode character has a corresponding entity. Unknown ones can be encoded like #8364;, so I would like my Encoding to use a simple function as a fallback. This proves hard. With CHECK == Encode::FB_WARN it looks like the whole string is left untouched, so my plan to just substr() off the first character, handle it by hand and repeat is not going to work. I'd be very happy with a CHECK mode which would allow me to handle a single problematic character in perl. Having to find it in a longer string is very hard in this case, because it's every character 0x{7f} which is not in my .ucm file. As a matter of fact, I was thinking of adding FB_HTMLENT or something like that. It seems trivial; Unless jhi whips me for the sin of Feeping Creaturism, I'll do so. CAVEAT; This will be done via fallback so will not turn into entities! Dan the Encode Maintainer
Re: Practical problems with custom .ucm based encoding
On Wednesday, April 24, 2002, at 09:43 , Bart Schuller wrote: Character Reference is the proper term, for entities you'd need my whole module. Please go completely overboard and have FB_XMLCHARREF in addition to FB_HTMLCHARREF, the difference being that the XML version would make it #x20ac; Shoot! I've just implemented FB_HTMLENT ! (quick, wasn't it?) Okay, be it CHARREF (or isn't there a good short abbreviation for that?). Let me check to make surewait, HTML Char Ref; #1234 XML Char Ref; #xabcd (just checked http://www.w3.org/TR/html4/charset.html). Dan the Encode Maintainer
Re: Practical problems with custom .ucm based encoding
On Wednesday, April 24, 2002, at 10:07 , Bart Schuller wrote: On Wed, Apr 24, 2002 at 09:56:29PM +0900, Dan Kogai wrote: Shoot! I've just implemented FB_HTMLENT ! (quick, wasn't it?) Okay, be it CHARREF (or isn't there a good short abbreviation for that?). Let me CHARREF is as short as I can make it. You don't happen to have an Encode interim release that I can test, do you? Sorry. I have picked CREF. Well, I have to prepend ENCODE_ for those macros so even 7 letters is too long ;). Well, at least they are documented. They are all implemented, with a test suite in t/fallback.t, and documented now. Stay tuned! Perhaps my encoding is also small enough to be included with Encode, if you think it makes sense. Please send me your UCM and I will review it. Well... maybe not. enc2xs is trivially easy and I should give CPAN modules a room to explore :) Isn't open source fun, instant response. Many thanks for your work on Encode, you are putting a lot of time in it. It definitely is. You're welcome. And as for the time I have spent, heck, that's what The First Great Virtue of a Programmer dictates me. I hope, no, DEMAND, that users save billions of hours w/ Perl 5.8.0 Dan the Lazy Man
FYI: Encode performance on Japanese encodings
I was curious to find how fast or slow Encode is against popular Japanese transcoder modules. So I benchmarked and relieved that Encode's performance was good! I benchmarked it against Jcode.pm (mine, too) and jcode.pl (the first and still popular transcoder available since Perl4 by Utashiro-san) and here is the result. Except for 7bit-jis (ISO-2022) - euc-jp Encode performed the best. And even for those losing against Jcode.pm, performance loss is not that big. Note Unicode conversion tests are missing from jcode.pl because they are unimplemented thereby. The Japanese has one more good reason to switch to Perl 5.8. Dan the Encode Maintainer. 7bit-jis - euc-jp Rate Encode jcode.pl Jcode.pm Encode 69.6/s -- -16% -40% jcode.pl 83.1/s 19% -- -29% Jcode.pm 116/s 67% 40% -- 7bit-jis - shiftjis Rate Jcode.pm jcode.pl Encode Jcode.pm 14.6/s -- -9% -80% jcode.pl 16.0/s 9% -- -78% Encode 71.9/s 393% 351% -- 7bit-jis - ucs2 Rate Jcode.pm Encode Jcode.pm 48.3/s -- -26% Encode 65.3/s 35% -- 7bit-jis - utf8 Rate Jcode.pm Encode Jcode.pm 31.9/s -- -63% Encode 86.5/s 171% -- euc-jp - 7bit-jis Rate jcode.pl Encode Jcode.pm jcode.pl 45.9/s -- -32% -60% Encode 67.7/s 48% -- -41% Jcode.pm 114/s 149% 69% -- euc-jp - shiftjis Rate Jcode.pm jcode.pl Encode Jcode.pm 16.8/s -- -16% -92% jcode.pl 20.0/s 19% -- -90% Encode206/s1129% 931% -- euc-jp - ucs2 Rate Jcode.pm Encode Jcode.pm 85.3/s -- -47% Encode160/s 87% -- euc-jp - utf8 Rate Jcode.pm Encode Jcode.pm 44.9/s -- -89% Encode400/s 791% -- shiftjis - 7bit-jis Rate Jcode.pm jcode.pl Encode Jcode.pm 13.3/s -- -7% -81% jcode.pl 14.2/s 7% -- -80% Encode 70.7/s 434% 397% -- shiftjis - euc-jp Rate Jcode.pm jcode.pl Encode Jcode.pm 15.1/s -- -25% -93% jcode.pl 20.1/s 33% -- -90% Encode210/s1285% 943% -- shiftjis - ucs2 Rate Jcode.pm Encode Jcode.pm 12.8/s -- -93% Encode175/s1270% -- shiftjis - utf8 Rate Jcode.pm Encode Jcode.pm 11.2/s -- -98% Encode512/s4456% -- ucs2 - 7bit-jis Rate Jcode.pm Encode Jcode.pm 61.8/s -- -1% Encode 62.4/s 1% -- ucs2 - euc-jp Rate Jcode.pm Encode Jcode.pm 138/s -- -9% Encode 151/s 9% -- ucs2 - shiftjis Rate Jcode.pm Encode Jcode.pm 14.9/s -- -91% Encode162/s 989% -- ucs2 - utf8 Rate Jcode.pm Encode Jcode.pm 33.3/s -- -87% Encode267/s 700% -- utf8 - 7bit-jis Rate Jcode.pm Encode Jcode.pm 59.5/s -- -18% Encode 72.7/s 22% -- utf8 - euc-jp Rate Jcode.pm Encode Jcode.pm 129/s -- -44% Encode 233/s 80% -- utf8 - shiftjis Rate Jcode.pm Encode Jcode.pm 14.7/s -- -94% Encode261/s1673% -- utf8 - ucs2 Rate Jcode.pm Encode Jcode.pm 50.6/s -- -74% Encode191/s 278% --
[Encode] 1.51 Released
I was anticipating the release of 1.51 AFTER I get to bed and back. But my insomnia and earlier-than-expected responses from NI-XS and Autrijus have accelerated the release by at lease 6 hours :) Get it via http://www.dan.co.jp/~dankogai/Encode-1.51.tar.gz or CPAN. Though changes are small codewise, This release includes the updates in two giant ucm files so diff will not be supplied. Please get the whole thing. 1.51 $Date: 2002/04/20 09:58:23 $ ! t/TW.t Updated test suite by Autrijis so make test is happy again Message-Id: [EMAIL PROTECTED] + ucm/big5-eten.ucm ! ucm/big5-hkscs.ucm lib/Encode/Alias.pm - ucm/big5.ucm TW/TW.pm TW/Makefile.PL Updates by Autrijus. 'big5' is no longer a canonical but an alias to 'big5-eten'. big5-hkscs is now in 2001 edition. Message-Id: [EMAIL PROTECTED] ! Encode.xs Fix by NI-XS that fallback may cause SEGV w/ Perl/TK Message-Id: [EMAIL PROTECTED] ! Encode.pm PerlIO detection a little bit smarter; no longer uses eval qq{} but eval {}. Dan the Encode Maintainer
Re: Encode-1.50 +
On Sunday, April 21, 2002, at 04:50 , Nick Ing-Simmons wrote: I just checked in these changes to ext/Encode/... as change 16022 on perlio branch. To honor whitespaces, I usually rsync perl-core first then copy filesback to my repository for NI-XS (this works only for patches from those w/ commit right to perl repository, however). But seems like AS is not new enough so... - switch to XSLoader - spelling trailing whitespace removal. - remove a use loop (Encode loaded PerlIO::encoding, loaded Encode) it never loops, but such things cause problems for imports. - Changed how LEAVE_SRC was tested x ~y is not same as !(x y) - Moved Unicode.xs towards supporting same check values. - Set Encode::XS::ISA to Encode::Encoding - added -needs_lines method with my best guess at which ones do. I did this; * Copy the patch chunk * perl -i.bak 's/\s+\n/\n/o' patch.file to make sure no trailing space after LF * patch -l so patch ignores the number of whitespaces ahead And the resulting patch work pretty good. Among 18 hunks one failed at Encode.pm and that was trivial to mend manually. and make distclean - breadperl Makefile.PL - make test works beautifully. I still cannot get TODO tests in t/perlio.t despite some work on PerlIO::encoding to honour -needs_lines. I need to study it some more. What I really want to do is get have PerlIO::encoding use fallback schemes. Which ENCODE_FB_XXX flag bit(s) give me fallback characters but still remove translated stuff from the src buffer? Perhaps update src should be an active rather than a passive bit? Please wait till caffain runs on my bloodstream. I just woke up (because of insomnia or whatever I was not quite nocturnal last night; It is 5 minutes before 06:00 AM JST). Dan the Encode Maintainer
Re: [big5-*.ucm] please revise if possible
On Sunday, April 21, 2002, at 02:32 , Autrijus Tang wrote: Updated maps and test: http://egb.elixus.org/~autrijus/big5-1.52.tgz Ucmlint still complains, due to the order issue outlined in the previous mail. As you have intelligently found, the order for duplicate map DOES matter; |1 or |3 have to come AFTER |0. So I wrote a quick and dirty sort program that just does this as follows; #! use strict; my lines; while (){ chomp; m/^U/o or next; push lines,[ split ]; } for (sort { $a-[0] cmp $b-[0] # Unicode descending order or $a-[2] cmp $b-[2] # fallback descending order or $a-[1] cmp $b-[1] # Encoding descending order } lines) { print join( = $_), \n; } __END__ And put the sorted text back to ucm files and now they all round-trip. This is easy to understand how enc2xs works. It has two hashes, %e2u and %u2e. If |0, it updates both. if |1 or |3, it updates either. And when update, old hash entry is overwritten. so |3 goes in vain if it is followed by |0. Maybe I should document this on enc2xs pod. XinKeLe! Dan the Encode Maintainer perl5.7.3 -Mblib bin/ucmlint -e ucm/big5-eten.ucm ucm/big5-eten.ucm:warning in line 421: dupe encode map: U2550 = F9,F9 and A2,A4 ucm/big5-eten.ucm:warning in line 436: dupe encode map: U255E = F9,E9 and A2,A5 ucm/big5-eten.ucm:warning in line 440: dupe encode map: U2561 = F9,EB and A2,A7 ucm/big5-eten.ucm:warning in line 450: dupe encode map: U256A = F9,EA and A2,A6 ucm/big5-eten.ucm:warning in line 454: dupe encode map: U256D = A2,7E and F9,FA ucm/big5-eten.ucm:warning in line 456: dupe encode map: U256E = A2,A1 and F9,FB ucm/big5-eten.ucm:warning in line 458: dupe encode map: U256F = A2,A3 and F9,FD ucm/big5-eten.ucm:warning in line 460: dupe encode map: U2570 = A2,A2 and F9,FC ucm/big5-eten.ucm: no error found dankogai@dan-attic[6276]:~/work/Encode perl5.7.3 -Mblib bin/ucmlint -e ucm/big5-hkscs.ucm ucm/big5-hkscs.ucm:warning in line 1900: dupe encode map: U301E = A1,AA and C6,DE ucm/big5-hkscs.ucm:warning in line 2710: dupe encode map: U4EDD = C9,69 and C6,DF ucm/big5-hkscs.ucm:warning in line 2932: dupe encode map: U50ED = B9,B0 and 9F,CB ucm/big5-hkscs.ucm:warning in line 2981: dupe encode map: U5159 = A2,59 and 92,AF ucm/big5-hkscs.ucm:warning in line 2983: dupe encode map: U515B = A2,5A and 92,B0 ucm/big5-hkscs.ucm:warning in line 2986: dupe encode map: U515D = A2,5C and 92,B1 ucm/big5-hkscs.ucm:warning in line 2988: dupe encode map: U515E = A2,5B and 92,B2 ucm/big5-hkscs.ucm:warning in line 4137: dupe encode map: U5C10 = C9,5C and 9C,BC ucm/big5-hkscs.ucm:warning in line 4384: dupe encode map: U5F0C = 93,61 and 9F,D8 ucm/big5-hkscs.ucm:warning in line 4509: dupe encode map: U6062 = AB,EC and 9E,A9 ucm/big5-hkscs.ucm:warning in line 4765: dupe encode map: U62CE = A9,F0 and A0,77 ucm/big5-hkscs.ucm:warning in line 4767: dupe encode map: U62D0 = A9,E4 and 9D,C4 ucm/big5-hkscs.ucm:warning in line 5935: dupe encode map: U6FB6 = BF,47 and 9B,F6 ucm/big5-hkscs.ucm:warning in line 5974: dupe encode map: U701E = 96,EE and 96,ED ucm/big5-hkscs.ucm:warning in line 6119: dupe encode map: U71DF = C0,E7 and 9C,62 ucm/big5-hkscs.ucm:warning in line 6165: dupe encode map: U7250 = 94,55 and A0,E4 ucm/big5-hkscs.ucm:warning in line 6337: dupe encode map: U7468 = 94,7A and A0,D5 ucm/big5-hkscs.ucm:warning in line 6659: dupe encode map: U77D7 = C5,F7 and 9B,78 ucm/big5-hkscs.ucm:warning in line 6825: dupe encode map: U79E3 = AF,B0 and 9C,BD ucm/big5-hkscs.ucm:warning in line 6958: dupe encode map: U7B51 = B5,AE and 9D,5A ucm/big5-hkscs.ucm:warning in line 6990: dupe encode map: U7BB8 = BA,E6 and 8E,69 ucm/big5-hkscs.ucm:warning in line 7084: dupe encode map: U7CCE = A2,61 and 8E,7E ucm/big5-hkscs.ucm:warning in line 7195: dupe encode map: U7DD2 = BA,FC and 8E,AB ucm/big5-hkscs.ucm:warning in line 7227: dupe encode map: U7E1D = BF,A6 and 8E,B4 ucm/big5-hkscs.ucm:warning in line 7368: dupe encode map: U8005 = AA,CC and 8E,CD ucm/big5-hkscs.ucm:warning in line 7387: dupe encode map: U8028 = BF,AE and 8E,D0 ucm/big5-hkscs.ucm:warning in line 7736: dupe encode map: U83C1 = B5,D7 and 8F,57 ucm/big5-hkscs.ucm:warning in line 7839: dupe encode map: U8503 = 92,42 and 92,44 ucm/big5-hkscs.ucm:warning in line 8047: dupe encode map: U880F = 8F,B6 and A0,63 ucm/big5-hkscs.ucm:warning in line 8181: dupe encode map: U89A6 = BF,CC and 8F,CB ucm/big5-hkscs.ucm:warning in line 8184: dupe encode map: U89A9 = A0,D4 and 8F,CC ucm/big5-hkscs.ucm:warning in line 8494: dupe encode map: U8D77 = B0,5F and 8F,FE ucm/big5-hkscs.ucm:warning in line 8825: dupe encode map: U90FD = B3,A3 and 90,6D ucm/big5-hkscs.ucm:warning in line 9045: dupe encode map: U936E = A0,5F and 92,C8 ucm/big5-hkscs.ucm:warning in line 9335: dupe encode map: U975C = C0,52 and 90,DC ucm/big5-hkscs.ucm:warning in line 9337: dupe
Encode-1.50 and PerlIO::encoding 0.02 released
I am daydreaming that I am a caravan member, driving a herd of disobedient camels on the never-ending desert to an oasis called 5.8.0 when I released new Encode and PerlIO::encoding. You can get one as follows. Whole: Encode http://www.dan.co.jp/~dankogai/Encode-1.50.tar.gz and CPAN PerlIO::encoding http://www.dan.co.jp/~dankogai/PerlIO-encoding-0.02.tar.gz Diff Encode http://www.dan.co.jp/~dankogai/current-1.50.diff.gz PerlIO::encoding [ none ] Diff is pretty big ( 3000 lines) so you should get a whole thing instead. The biggest and the foremost change is the fallback API which is greatly enhanced. NI-XS request of On Friday, April 19, 2002, at 05:01 , Nick Ing-Simmons wrote: check == 11 - silent fail with $string updated (What Tk wants) is implemented as FB_QUIET. see below; Handling Malformed Data THE CHECK argument is used as follows. When you omit it, it is identical to CHECK = 0. CHECK = Encode::FB_DEFAULT ( == 0) If CHECK is 0, (en|de)code will put substitution char- acter in place of the malformed character. for UCM- based encodings, subchar will be used. For Unicode, \xFFFD is used. If the data is supposed to be UTF-8, an optional lexical warning (category utf8) is given. CHECK = Encode::DIE_ON_ERROR (== 1) If CHECK is 1, methods will die immediately with an error message. so when CHECK is set, you should trap the fatal error with eval{} unless you really want to let it die on error. CHECK = Encode::FB_QUIET If CHECK is set to Encode::FB_QUIET, (en|de)code will immediately return proccessed part on error, with data passed via argument overwritten with unproccessed part. This is handy when have to repeatedly call because the source data is chopped in the middle for some reasons, such as fixed-width buffer. Here is a sample code that just does this. my $data = ''; while(defined(read $fh, $buffer, 256)){ # buffer may end in partial character so we append $data .= $buffer; $utf8 .= decode($encoding, $data, ENCODE::FB_QUIET); # $data now contains unprocessed partial character } CHECK = Encode::FB_WARN This is the same as above, except it warns on error. Handy when you are debugging the mode above. perlqq mode (CHECK = Encode::FB_PERLQQ) For encodings that are implemented by Encode::XS, CHECK == Encode::FB_PERLQQ turns (en|de)code into perlqq fallback mode. When you decode, '\xXX' will be placed where XX is the hex representation of the octet that could not be decoded to utf8. And when you encode, '\x{}' will be placed where is the Unicode ID of the charac- ter that cannot be found in the character repartoire of the encoding. The bitmask These modes are actually set via bitmask. here is how FB_XX are laid out. for FB_XX you can import via use Encode qw(:fallbacks) for generic bitmask constants, you can import via use Encode qw(:fallback_all). FB_DEFAULT FB_CROAK FB_QUIET FB_WARN FB_PERLQQ DIE_ON_ERR0x0001 X WARN_ON_ER0x0002 X RETURN_ON_ERR 0x0004 XX LEAVE_SRC 0x0008 PERLQQ0x0100X Unemplemented fallback schemes In future you will be able to use a code reference to a callback function for the value of CHECK but its API is still undecided. Since PerlIO::encoding was uncapable of using this new feature, I have updated PerlIO::encoding as well; Instead of pushing PL_sv_yes to stack, now struct PerlIOEncode has one more member, chk, that is initialized with Encode::FB_QUIET. typedef struct { PerlIOBuf base; /* PerlIOBuf stuff */ SV *bufsv; /* buffer seen by layers above */ SV *dataSV; /* data we have read from layer below */ SV *enc;/* the encoding object */ SV *chk;/* CHECK in Encode methods */ } PerlIOEncode; Encode now checks the version of PerlIO::encoding and refuse to use an obsolete version. see t/perlio.t on details. That way PerlIO::encode has no trouble should Encode changes the value of FB_QUIET. As for the partial character problem, I have found it is nearly impossible for escape-based encodings to
Re: [PATCH] Big5-related changes.
On Saturday, April 20, 2002, at 04:53 , Autrijus Tang wrote: I've been immersed in Big5-related issues in the past few days, and came back with these last-minute (err, week?) changes before 5.8-RC1. The Diff contains fixes to TW.pm, Alias.pm, and README.(tw|cn). Excellent! (For dan) big5-hkscs should be upgraded to the 2001 edition, as per Hong Kong government's decree. It's available separately at: http://egb.elixus.org/~autrijus/big5-hkscs.ucm.gz Also, please delete big5.ucm and replace it with big5-eten, at: http://egb.elixus.org/~autrijus/big5-eten.ucm.gz Thus updated. I needed to update TW/Makefile.PL and lib/Encode/Config.pm (so it loads on 'big5-eten' instead of just 'big5'). but that's not at all a big deal. I've fixed Alias.pm so big5 aliases to big5-eten. The reason is that the 'Big5' as originally defined isn't used anywhere on earth; non- Microsoft systems uses 'big5' to mean 'big5-eten', and Microsoft uses 'big5' to mean 'cp950'. It is therefore unwise to have a canonical 'big5' encoding, much like there should not be a 'gb2312' encoding. Since gb2312 is now aliased to euc-cn and not cp936, I think big5 should alias to big5-eten and not cp950. I agree. AFAIK, Big5 is the only major CJK encoding not endorsed by the government. What's so funny is that there seems less confusions between encodings there in Taiwan than in Japan or Korea. Japan is the worst for using Shift_JIS, EUC-JP, ISO-2022-JP(-[12])? and now Unicode (IMHO, however, the Japanese people should be proud for making multibyte character encoding a reality. But I can't help wondering this mess is way too much a price to pay :) Oh, I just noticed that Dan retained the 'gb2312.ucm' name, although the encoding is called 'gb2312-raw'. I admit that I don't fully understand the reason, but if that's to stand, then big5-eten could also be named 'big5.ucm', and still say 'code_set_name big5-eten', for consistency's sake. I renamed big5.ucm to big5-eten.ucm. -raw that are missing from *.ucm filenames is just that they look too funny on 8.3 filesystems, nothing more :) Thanks, /Autrijus/ Xin Ku Le ! \x{8f9b}\x{82e6}\x{4e86} XiaoSi Dan \x{5c0f}\x{98fc} \x{5f3e}\n
Re: Tk804 + Encode-1.50 :-) again
On Saturday, April 20, 2002, at 03:45 , Nick Ing-Simmons wrote: Dan Kogai [EMAIL PROTECTED] writes: I am daydreaming that I am a caravan member, driving a herd of disobedient camels on the never-ending desert to an oasis called 5.8.0 when I released new Encode and PerlIO::encoding. You can get one as follows. p4 integrated to //depot/perlio for testing. Without any changes to Tk804 things improved a bit - only the JP.t and KR.t tests were failing, and those not failing as badly. I though I relocated perlio-related test in them to t/perlio.t. Is there any left? Adding ENCODE_FB_QUIET to Tk's encode glue makes those pass as well. That was my biggest concern. So glad to hear that. Suggest one small tweak as in attached patch. The patch turns off utf8_to_uvuni's warning and checks as only thing we are using the UV for is an error message (which in my case isn't going to be printed as I am in FB_QUIET). Otherwise I get noise when Tk is groping about in U+FFXX page. Applied, thanks. The indent looks better - but has cuddled else - no big deal. I was a little surprised that Encode/encode.h gets installed in lib rather than archlib/CORE but can live with that (makes a kind of sense it is architecture neutral - but perl.h et. al. go elsewhere). The snag here is that Makefile.PL has added -I to find perl.h, so I have to #include ../../Encode/encode.h which is portability issue as there is no certainty that lib / archlib relative paths work like that. Will tweak Tk's Makefile.PL configure to hunt down encode.h. I wonder if there is more sensible way to install NON-PM files to PERL5LIB. For the time being it is at the mercy of MM. Though not a show stopper I would like Encode to be as clean and standard-compliant as possible. MM is so vast I don't even know how many more features are hidden... Will do a spelling patch on the pod(s) when I get a chance. Yes, please. Emacs doesn't do spellcheck-as-you-type like recent mailers in MacOS and Windows :) (I know you can spellcheck in Emacs but I am not sure if it is a good idea to to do so in .pm). Dan the Encode Maintainer
[Encode] Dark Side of the Emacs Modes [Was: Re: Tk804 ...]
On Saturday, April 20, 2002, at 05:38 , Nicholas Clark wrote: On Sat, Apr 20, 2002 at 04:27:15AM +0900, Dan Kogai wrote: Yes, please. Emacs doesn't do spellcheck-as-you-type like recent mailers in MacOS and Windows :) (I know you can spellcheck in Emacs but I am not sure if it is a good idea to to do so in .pm). You underestimate the power of the dark side. M-x flyspell-mode I knew something like this existed but never checked the mode name :) Hmm Requires ispell... Piece of cake with portupgrade (could be the most widely used ruby program in (Free)BSD world) Oh man! you're right! It even supports mouse (but I usually use emacs only via tty). But how about perl jargons? automagicalNi! barewordsNi! Hmm. This mode needs some more education :) Thanks. More than 10 years w/ Emacs and still lost in modes Definitely part of the dark side because here it defaults to American. Does it correct pronunciation of the Britons so CAN'T do that sounds less obscene :? And then refuses to start because I don't have American dictionaries installed. ispell has no problem just running and finding the correct dictionaries. Dan the Emacs User, not Elisp Hacker ^pretty funny. MacOS X Mail underline this but not Emacs. Is it smart enough to scan $PATH and make them correct?
Re: Please update Encode::HanExtra
On Thursday, April 18, 2002, at 04:40 , Autrijus Tang wrote: On Thu, Apr 18, 2002 at 11:41:48AM +0900, Dan Kogai wrote: http://www.dan.co.jp/~dankogai/Encode-HanExtra-0.04.tar.gz Please pick it up, add necessary changes and upload YOUR version to CPAN. Okay, will do. XieXieGeZuo But. Do we want optional compatibility with 5.7.[23]? i.e. only use enc2xs where it's available. I don't. So far we have no 5.8.0, meaning we don't have to think about backward compatibility at all -- yet. Dan the Encode Maintainer
Encode-1.42 PerlIO-encoding-0.01 now available
NI-XS, jhi and porters, The surgical operation is finished. PerlIO layer functions in Encode.xs has been successfully detached. Now PerlIO part is in PerlIO::encoding. They are now more like interdependent than dependent. You can get one via URLs below; http://www.dan.co.jp/~dankogai/PerlIO-encoding-0.01.tar.gz http://www.dan.co.jp/~dankogai/Encode-1.42.tar.gz http://www.dan.co.jp/~dankogai/perl-dan.tar.bz2 The last one is the whole perl with interdependent versions of Encode and PerlIO. As a matter of fact, just replace Encode with 1.42 above, untargzip PerlIO-encoding-0.01 at ext/PerlIO/ and rename the thawed directory to encoding, and fix toplevel MANIFEST and it will work perfectly. Configure file needed now modification. Here is how Encode tests as a module. t/Aliases.ok t/CN..ok t/Encode..ok t/Encoder.ok t/JP..ok, 6/27 skipped: PerlIO Encoding Needed t/KR..ok, 6/22 skipped: PerlIO Encoding Needed t/TW..ok t/Unicode.ok t/encodingok t/growok t/jperl...ok All tests successful, 12 subtests skipped. Files=11, Tests=4616, 11 wallclock secs ( 7.52 cusr + 0.50 csys = 8.02 CPU) And with Whole perl and PerlIO ext/Encode/t/CN.ok ext/Encode/t/Encode.ok ext/Encode/t/Encoderok ext/Encode/t/JP.ok ext/Encode/t/KR.ok ext/Encode/t/TW.ok ext/Encode/t/Unicodeok ext/Encode/t/encoding...ok ext/Encode/t/grow...ok ext/Encode/t/jperl..ok [] ext/PerlIO/PerlIO...ok ext/PerlIO/t/encoding...ok ext/PerlIO/t/scalar.ok ext/PerlIO/t/viaok See ext/PerlIO/t/encoding.t was never modified. So it is 100% compatible with the prior version. FYI those will not be uploaded to CPAN; I'll wait until perl-current catches up. And PerlIO::encoding is not mine but NI-XS. So if it is to be CPANized, it must be done by NI-XS (I pretty much doubt if he does, however). .Man, I'm exhausted. Autrijus, Jungshik, sorry for not responding soon. Please let me take a nap before I process your new READMEs. Dan the Encode Maintainer.
Re: iso-2022-jp problem
On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote: I tracked down the problem tkmail was/is having with iso-2022-jp. The snag is I am using the API the way I designed it, not the way it is reliably implemented. When called thus: my $decoded = $enc-decode($encoded,1); decode is supposed to return portion it can decode, and set $encoded to what remains. Ah, I see. But it is pain in the arse for doubly-encoded encodings like ISO-2022-JP. Here is the problem. As you see, to decode ISO-2022-JP, we first have to decode it into EUC-JP. And ISO-2022-JP - EUC-JP is treated (and should be treated) purely as a CES so there is no chance for error (unless there is a bogus escape sequence). However, errors may rise when you try to convert the resulting EUC-JP stream to UTF-8. The problem is that not all of the possible code points in JIS X 0208 and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 are used in 0208 and 6072 are used in 0212. So the remainder won't map to Unicode. It was possible to use jis02*-raw instead of EUC-JP but that implementation was too slow because you have to invoke encode() chunk by chunk. in fact I tried and it got 3 times as slow. And what is a sense of what remain gets moot when it comes to ISO-2022. Suppose you got a string like this; abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu ^^error occurs here. What's the remaining stream? ghijklmnESC-to-asciiopqrstu is WRONG because we are now in jis0208 chunk and escape sequence is already stripped. Do we have to go like ESC-to-jis0208ghijklmnESC-to-asciiopqrstu but that slows down the encoder too much. I just woke up. Let me think about this a little bit more Dan the Encode Maintainer
Re: iso-2022-jp problem
On Tuesday, April 16, 2002, at 12:00 , Nick Ing-Simmons wrote: abcdESC-to-jis0208cdefghijklmnESC-to-asciiopqrstu ^^error occurs here. What's the remaining stream? ghijklmnESC-to-asciiopqrstu Does not matter for that case. does not map is a fatal error with $chk true (and would have become a replacement char if $chk was false). What matters is being able to tell the complete case, from partial case. A. When you have converted whole thing set remains to ''. B. When you have a partial encoding consume as much as you can and leave string with what is partial. e.g. abcdESC-to-jis0208cdefghijklmnESC-to -asciiopqrstu ^- buffer boundary Then you return translation of abcdESC-to-jis0208cdefghijklmn and set remains to Esc-to so that :encoding can append -asciiopqrstu One of many reasons that programmers dislike 7bit ISO-2022 is exactly how to handle case B -- how to split the buffer in the middle When handling 7bit ISO-2022, YOU ARE NOT SUPPORSED TO SPLIT THE BUFFER BY LENGTH. Of course that causes the problem for large files and even worse, network streams. But fortunately, 7bit ISO-2022 has one safety net for that solution; IT ALWAYS REVERTS TO ASCII BEFORE CONTROL CHARACTERS, including CRLF. So if you need it you can safely split buffer line by line. A script binmode(STDOUT, :utf8); while(){ print Encode:decode(iso-2022-jp, $_); } is completely safe because $_ is guaranteed to start in ASCII and end in ASCII. Check RFC 1468 (http://www.ietf.org/rfc/rfc1468.txt and others). It is not as complicated as it sounds. If you cannot do that then don't return or consume anything so :encoding can keep appending till you have whole file but that is going to be very memory hungry. As I said, if you are worried about memory, just use line buffer is the answer. Other encodings are subject to this boundary problem -- and solution. Arabic and Hebrew (BIDI boundary), Thai (word boundary), Hangul (for decomposed form), you name it. But very fortunately for all these, legacy encodings for those are all designed so that you can rely on CRLF to split the stream. Dan the Encode Maintainer.
Re: iso-2022-jp problem
On Tuesday, April 16, 2002, at 01:06 , Nick Ing-Simmons wrote: So we need some way of telling from an encoding object (e.g. an attribute or a method call) that it needs line buffering so that :encoding layer can take the appropriate steps. Okay, which way do you like, attribute or method ? I think method is more elegant but attribute seems easier to fetch. Since this is more for PerlIO than Encode itself, I would appreciate if you gave me the API (just name would be enough) and I will add them to ISO-2022 stuff (not just JP but KR has one, too). Dan
Re: README.jp (or README.jp?)
On Tuesday, April 16, 2002, at 08:14 , Jarkko Hietaniemi wrote: Could I ask for the Japanese translation? (Check out Autrijus' latest message about the subject, they had a useful additional section.) Sorry. I was too preoccupied w/ the module itself. Will be submitted before I go to bed. Dan
[Encode] 1.40 will be released in a few hours!
Folks, I will release ver. 1.40 of Encode after the smoke testings are done. With In-XSimmons' XS version of Unicode transcoders, encoding.pm enhancements and fixes (that led to child gets croaked before born bug discovery), and other nits picked, simple version increment is not enough. * With all modules loaded, it can transcode some 113 encodings and it is easy to add more via enc2xs. * With encoding pragma, you can emulate Jperl and more * Though Encode accounts for some 30% of PERL5LIB in size, its memory consumption is not that big. Here is a list of core file sizes via dump immediately after modules loaded on my FreeBSD box. perl alone 774,144 bytes No Encode::XX 1,171,456 With All 2,990,080 All+HanExtra 3,534,848 * I decided not to include Indics. It is MY obsession to include all encodings that are available in unicode.org but come to think of it, HanExtra is already 'external' and for other Encodings there are always others that are 'obsession'. So I decided to wait till my obsession becomes 'ours'. And I already added '-C' option to enc2xs so postinstalled modules can also join the demand-loading list. Better take time to let it mature enough for production quality. Detailed Changes right after my signature. Dan the Encode Maintainer 1.40 + Encode/ConfigLocal_PM.e2x ! lib/Encode/Config.pm ! bin/enc2xs enc2xs -C now generates/updates Encode::ConfigLocal. ConfigLocal_PM.e2x is a skelton thereof. ! lib/Encode/Config.pm ! CN/CN.pm use Encode::CN::HZ; was missing. ! t/Unicode.t ! t/unibench.t More rigorous tests added to test XS, especially on memory allocation. ! Encode.xs ! lib/Encode/Unicode.pm NI-S implemented an XS version -- merged Message-Id: [EMAIL PROTECTED] ! encoding.pm ! t/jperl.t Source filter option added. With this option on, you can write perl 5.8-savvy scripts (such as UTF-8 identifiers) in legacy encodings. t/jperl.t enhanced to test this feature. ! t/Unicode.t ok() gotcha addressed by Benjamin fixed. Though I didn't exactly apply his suggestion, this degree of nitting is enough to add him to AUTHORS list. Message-Id: [EMAIL PROTECTED] ! JP/JP.pm + lib/Encode/JP/JIS7.pm - lib/Encode/JP/JIS.pm - lib/Encode/JP/2022_JP.pm - lib/Encode/JP/2022_JP1.pm 7bit-jis, iso-2022-jp and iso-2022-jp1 are all aggregated to JIS7.pm for better maintainability and performance ! encoding.pm Added caveat for non-ascii identifiers. ! encoding.pm fixes by jhi, the original author of this pragramtic module. Message-Id: [EMAIL PROTECTED]