Re: [Encode] euc-jp vs euc-jisx0213
On Monday, April 29, 2002, at 07:38 , SADAHIRO Tomoyuki wrote: I doubt whether users of 'euc-jp' will assume it to be a combination with JIS X 0213. They don't have to because 'euc-jp' behaves exactly the same as before so long as the charset is in ASCII/JISX(0201|0208|0212). Such a mixing would prevent warning/croaking for appearance of code points that are not defined originally (meaning w/o X 0213), wouldn't it? That was my biggest concern but I have decided to go ahead with euc-jp to (partially) support JIS X 0213 and the reason is simple; Encode::JP is already too big to differentiate between various euc-jp. In such cases, we should settle for the most 'comprehensive' version. Even the term 'euc-jp' is too ambiguous for many; At first it didn't include G3 and some say they must be clearly marked as something like 'euc-jp-classic' (no 0212 support) vs 'euc-jp-modern' and so forth (then our current euc-jp should be marked as 'euc-jp-postmodern' :). It would be nice if we can go that way like 7bit-JIS/ISO-2022-JP/ISO-2022-JP-1 but for euc-jp we have to have a whole ucm for each. This is definitely a todo for Perl 5.8.1 and up and I have already come up with a solution; the future Encode (Encode II) will support CES-generator; that is, you can express euc-jp not as a whole big table but a combination of tables. That will also reduce the duplicates found in vendor mappings. It will be a complete rewrite of encengine.c But that requires not only codes but the expansion of UCM format so give me more time (and Perl 5.8.0!) Dan the Encode Maintainer
[Encode] Encode-JIS2K-0.01 uploaded to CPAN
Folks, I gotta go in 5 minutes so I just dump the README file after the sig. Dan the Encode Maintainer NAME Encode::JIS2K - JIS X 0212 (aka JIS 2000) Encodings INSTALLATION To install this module type the following: perl Makefile.PL make make test make install SYNOPSIS use Encode::JIS2K; use Encode qw/encode decode/; $euc_2k = encode(euc-jisx0213, $utf8); $utf8 = decode(euc-jisx0213, $euc_jp); ABSTRACT This module implements encodings that covers JIS X 0213 charset (AKA JIS 2000, hence the module name). Encodings supported are as follows. Canonical Alias Description euc-jisx0213 qr/\beuc.*jp[ \-]?(?:2000|2k)$/i EUC-JISX0213 qr/\bjp.*euc[ \-]?(2000|2k)$/i qr/\bujis[ \-]?(?:2000|2k)$/i shiftjisx0123 qr/\bshift.*jis(?:2000|2k)$/i Shift_JISX0213 qr/\bsjisp \-]?(?:2000|2k)$/i iso-2022-jp-3 jis0213-1-raw JIS X 0213 plane 1, raw format jis0213-2-raw JIS X 0213 plane 2, raw format DESCRIPTION To find out how to use this module in detail, see the Encode manpage. what is JIS X 0213 anyway? Simply put, JIS X 0213 is a rework and reorganization of JIS X 0208 and JIS X 0212. They consist of two 94x94 planes which roughly corrensponds as follows; JIS X 0213 Plane 1 = JIS X 0208 + extension JIS X 0213 Plane 2 = JIS X 0212 reorganized + extension And here is the character repertoire there of at a glance. # of codepoints Kuten Ku (rows) used JIS X 0208 6,8791..8,16..83 JIS X 0213-1 8,7621..94 (all!) JIS X 0212 6,0672,6..7,9..11,16..77 JIS X 0213-2 2,4361,3..5,8,12..15,78..94 --- (JIS X0213 Total) 11,197 JIS X 0213 was designed to extend JIS X 0208 and JIS X 0212 without being imcompatible to (classic) EUC-JP and Shift_JIS. The following characteristics are as a result thereof. o JIS X plane 1 is (almost) a superset of JIS X 0208. However, with Unicode 3.2.0 the mappings differ in 3 codepoints. Kuten JIS X 0208 - Unicode JIS X 0213 - Unicode -- 1-1-17 UFFE3 # FULLWIDTH MACRONU203E # OVERLINE 1-1-29 U2014 # EM DASH U2015 # HORIZONTAL BAR 1-1-79 UFFE5 # FULLWIDTH YEN SIGN U00A5 # YEN SIGN o By the same token, JIS X 0213 plane 2 contains JIS Dai-4 Suijun Kanji (JIS Kanji Repertoire Level 4). This allows EUC-JP's G3 to contain both JIS X 0212 and JIS 0213 plane 2. However, JIS X 0212:1990 already contains many of Dai-4 Suijun Kanji so EUC's G3 is subject to containing dupli- cate mappings. o Because of Halfwidth Katakana, Shift_JIS mapping has been tricky and it is even trickier. Here is a regex that matches Shift_JISX0213 sequence (note: you have to use bytes to make it work!) $re_valid_shifjisx0213 = qr/^(?: [\x00-\x7f] |# ASCII or [\xa1-\xdf] |# JIS X 0201 KANA or [\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc] # JIS X 0213 )+$/xo; Note on EUC-JISX0213 (vs. EUC-JP) As of Encode-1.64, 'euc-jp' does support euc-jisx0213 for decoding. However, 'euc-jp' in Encode and 'euc-jisx0213' differ as follows; euc-jp euc-jisx0213 -- Decodes (0201-K|0208|0212|0213) ditto Round-Trip (|0) (020-K|0208|0212)JIS X (0201-K|0213) Decode Only (|3) those only found in 0213 those only found in 0212 -- AUTHORS Dan Kogai [EMAIL PROTECTED] COPYRIGHT Copyright 2002 by Dan Kogai [EMAIL PROTECTED]. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See http://www.perl.com/perl/misc/Artistic.html SEE ALSO the Encode manpage, the Encode::JP
Encode doesn't like undef
This is with Encode 1.64 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)' Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. I don't know Encode well enough to check if there are any other places this will strike. Paul
Apache::GuessCharset - Encode::Guess + I18N::Charset
Hi, I've made a tiny PerlFixupHandler to guess files' encodings and automatically add charset attribute into Content-Type accorging to the guess, using Encode::Guess and I18N::Charset. This module can be a powerfull replacement/suppliment for Apache's Add*Charset stuff. The module can be downloaded from http://bulknews.net/lib/archives/Apache-GuessCharset-0.01.tar.gz Well, this module uses I18N::Charset internally to translate encoding names into IANA registerd name (and its own table to convert it to preferred MIME again). How about porting this module into Encode, which might be useful? Thanks, -- Tatsuhiko Miyagawa [EMAIL PROTECTED] NAME Apache::GuessCharset - adds HTTP charset by guessing file's encoding SYNOPSIS PerlModule Apache::GuessCharset SetHandler perl-script PerlFixupHandler Apache::GuessCharset # how many bytes to read for guessing (default 512) PerlSetVar GuessCharsetBufferSize 1024 # list of encoding suspects PerlSetVar GuessCharsetSuspects euc-jp PerlAddVar GuessCharsetSuspects shiftjis PerlAddVar GuessCharsetSuspects 7bit-jis DESCRIPTION Apache::GuessCharset is an Apache handler which adds HTTP charset attribute by automaticaly guessing file' encodings via Encode::Guess. CONFIGURATION This module uses following configuration variables. GuessCharsetSuspects a list of encodings for Encode::Guess to check. See the Encode::Guess manpage for details. GuessCharsetBufferSize specifies how many bytes for this module to read from source file, to properly guess encodings. default is 512. AUTHOR Tatsuhiko Miyagawa [EMAIL PROTECTED] This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. SEE ALSO the Encode::Guess manpage, the Apache::File manpage
Re: Apache::GuessCharset - Encode::Guess + I18N::Charset
At Tue, 30 Apr 2002 09:45:30 -0400, Thurn, Martin (Intranet) [EMAIL PROTECTED] wrote: preferred MIME again). How about porting this module into Encode, which might be useful? ^^^ What exactly do you have in mind? Add the charset names used by Encode? That would be great... if I ever get time to play with 5.8.0 etc...? Something like that. My first thought was how do I get IANA registered name from current encoding via Encode?. Then I searched CPAN and came upon your I18N::Charset. If I18N::Charset works seamlessly with Encode, it'll be nice. But this would something for 5.8.1. -- Tatsuhiko Miyagawa [EMAIL PROTECTED]
Re: Encode doesn't like undef
On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote: This is with Encode 1.64 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)' Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. I don't know Encode well enough to check if there are any other places this will strike. I think we'd better leave that way; It needs a PV to (en|de)code so consider this a feature. Of course perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()' is perfectly safe and legal. Dan the Encode Maintainer
RE: Encode doesn't like undef
From: Dan Kogai [mailto:[EMAIL PROTECTED]] On Tuesday, April 30, 2002, at 07:14 , Paul Marquess wrote: This is with Encode 1.64 $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8(undef)' Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. I don't know Encode well enough to check if there are any other places this will strike. I think we'd better leave that way; It needs a PV to (en|de)code so consider this a feature. I agree that passing undef() to one of the encoding functions may be an edge condition too far, but passing a variable that contains undef is more common. $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)' Name main::a used only once: possible typo at -e line 1. Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. Can this be detected silenced? Of course perl5.7.3 -w -MEncode -e 'Encode::encode_utf8()' is perfectly safe and legal. Paul
Re: Encode doesn't like undef
On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote: I agree that passing undef() to one of the encoding functions may be an edge condition too far, but passing a variable that contains undef is more common. $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)' Name main::a used only once: possible typo at -e line 1. Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. Can this be detected silenced? You've got a point. Warning should warn when and only when there is a danger therein and passing undef itself is harmless. And this can be done easily by adding defined $str or return; for each sub concerned. Okay, I'll go for that. Dan the Encode Maintainer
2nd Last Call for the Character Model for the WWW
Owing to the large number of Last Call comments on the: Character Model for the World Wide Web W3C Working Draft 26 January 2001 and to the large number of changes these comments gave rise to, the W3C I18N WG has decided to issue a 2nd Last Call for comments: Character Model for the World Wide Web W3C Working Draft 30 April 2002 http://www.w3.org/TR/2002/WD-charmod-20020430 The document's abstract says: This Architectural Specification provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set, defined jointly by Unicode and ISO/IEC 10646. Some introductory material on characters and character encodings is also provided. Please review the document and submit any comments by 31 May 2002. Comments should preferably be submitted via the Last Call Comment Form (http://www.w3.org/2002/05/charmod/LastCall). They may alternatively be submitted by email to [EMAIL PROTECTED] To help us process your comments, please submit each comment separately. If doing so is too awkward, please number your comments clearly. Many thanks, Misha Wolf W3C I18N WG Chair -- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.
RE: Encode doesn't like undef
From: Dan Kogai [mailto:[EMAIL PROTECTED]] On Tuesday, April 30, 2002, at 11:42 , Paul Marquess wrote: I agree that passing undef() to one of the encoding functions may be an edge condition too far, but passing a variable that contains undef is more common. $ perl5.7.3 -w -MEncode -e 'Encode::encode_utf8($a)' Name main::a used only once: possible typo at -e line 1. Use of uninitialized value in subroutine entry at /tmp/bleed/lib/perl5/5.7.3/sun4-solaris/Encode.pm line 183. Can this be detected silenced? You've got a point. Warning should warn when and only when there is a danger therein and passing undef itself is harmless. And this can be done easily by adding defined $str or return; for each sub concerned. Okay, I'll go for that. Yep, I think that's the fix to go for. Paul
RE: Encode doesn't like undef
Paul Marquess [EMAIL PROTECTED] writes: I agree that passing undef() to one of the encoding functions may be an edge condition toois more common. Paraphase nick@bactrian 1078$ perl -w -e print undef Use of uninitialized value in print at -e line 1. nick@bactrian 1079$ perl -w -e print $a Use of uninitialized value in print at -e line 1. nick@bactrian 1080$ perl -w -e '$a =~ tr/A/a/' Name main::a used only once: possible typo at -e line 1. Use of uninitialized value in transliteration (tr///) at -e line 1. I agree that passing undef() to print/tr functions may be an edge condition too far, but passing a variable that contains undef is more common. Can this be detected silenced? /Paraphrase Yes it could but we don't for very good reasons. -- Nick Ing-Simmons http://www.ni-s.u-net.com/
Re: [Encode] 1.65 released
Dan Kogai [EMAIL PROTECTED] writes: $Revision: 1.65 $ $Date: 2002/04/30 16:13:37 $ ! Encode.pm encode(undef) no longer warns for CUse of uninitialized value in subroutine entry. Suggested by Paul. Can I get warnings + fallbacks yet? Can we have a bit to enable warns on undef please ;-) There are casting issues with perlqq etc. using (say) UVxf but only passsing a U8 (e.g. s[slen]) and not a UV which format expects. Message-Id: [EMAIL PROTECTED] ! lib/Encode/Supported.pod Encode::MIME::Header and Encode::Guess mentioned Updated for Encode::HanExtra 0.05 and Encode::JIS2K ! lib/Encode/Guess.pm POD fix by Miyagawa-kun Message-Id: [EMAIL PROTECTED] Dan the Encode Maintainer -- Nick Ing-Simmons http://www.ni-s.u-net.com/
Encode should stay undefphobia
On Wednesday, May 1, 2002, at 02:10 , Nick Ing-Simmons wrote: Dan Kogai [EMAIL PROTECTED] writes: Please don't. $a =~ tr/A/a/; gives a warning so should encode/decode. How can I be so dumb for not anticipating you say that! (Blame it on the fever). Paul, I now think Nick's got more points than yours so I will revert it in the next version. Maybe I will document this undef-phobia of Encode subs in the POD Dan the Warned Man
[Encode] 1.66 Released
My fever is down at last when I released Encode-1.66, available as follows; Whole: http://www.dan.co.jp/~dankogai/Encode-1.66.tar.gz or CPAN Diff against current: 264 lines http://www.dan.co.jp/~dankogai/current-1.66.diff.gz And Changes. $Revision: 1.66 $ $Date: 2002/05/01 05:41:06 $ ! Encode.xs t/fallback.t WARN_ON_ERR no longer assumes RETURN_ON_ERR so you can issue a warning while fallback is in effect. This even came with a welcome side-effect of cleaner code with less nests! Thank you, NI-XS. t/fallback.t is also modified to test this. And of course, the corresponding varialbles to UV[Xx]f are appropriately cast. This should've concluded NI-XS homework. ! Encode.pm encode(undef) does warn again! Repented upon suggestion by NI-XS. Document for unless vs. '' added Message-Id: [EMAIL PROTECTED] As you see, this is a NI-XS homework issue. Now I have only djgpp to left (I think. djgpp is just s slow on my env.) Dan the Encode Maintainer