On Tue, Mar 26, 2002 at 09:07:25AM +0900, Dan Kogai wrote: > Encode hackers (Especially Autrijius) > > I am now fairly content with the feature set of Encode so I decided to > write some programs based upon it. > And I have found that most of Chinese (Continental; seems like > Taiwanese are much more technically correct) and Korean mails and web > pages confuse "charset" and "encodings". That is, charset="gb2312" > really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr. > Sadly this misconception is enbedded to popular browsers. > So when you try something like > > my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o; > .... > my $utf8 = encode($encname, $string); > > You are in big trouble. Aliases is no salvation because most web > pages in *.cn happily includes > > <META http-equiv="Content-Type" content="text/html; charset=gb2312"> > > It seems to them it is taken for granted that encoding is simply a > charset encoded in EUC. Anton has wistfully states this in > Encode::Supported but I didn't realize the depth of problem until I put > Encode from in vitro to in vivo (that is, out of lab and into real > world). > So I propose to; > > * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw
-raw sounds funny, as if it were somehow "unprocessed" version. How about -strict? > * and alias gb2312 and ksc5601 to euc-(cn|kr) > > I know it's technically wrong but perl opts more for practical than > technical.... > > Dan the Man with Too Many SPAMs form CN and KR > -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen