On Tue, Mar 26, 2002 at 09:07:25AM +0900, Dan Kogai wrote:
> Encode hackers (Especially Autrijius)
> 
>    I am now fairly content with the feature set of Encode so I decided to 
> write some programs based upon it.
>    And I have found that most of Chinese (Continental; seems like 
> Taiwanese are much more technically correct) and Korean mails and web 
> pages confuse "charset" and "encodings".  That is, charset="gb2312" 
> really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr.  
> Sadly this misconception is enbedded to popular browsers.
>    So when you try something like
> 
>    my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o;
>    ....
>    my $utf8 = encode($encname, $string);
> 
>    You are in big trouble.  Aliases is no salvation because most web 
> pages in *.cn happily includes
> 
>    <META http-equiv="Content-Type" content="text/html; charset=gb2312">
> 
>    It seems to them it is taken for granted that encoding is simply a 
> charset encoded in EUC.  Anton has wistfully states this in 
> Encode::Supported but I didn't realize the depth of problem until I put 
> Encode from in vitro to in vivo (that is, out of lab and into real 
> world).
>    So I propose to;
> 
> * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw

-raw sounds funny, as if it were somehow "unprocessed" version.
How about -strict?

> * and alias gb2312 and ksc5601 to euc-(cn|kr)
> 
>    I know it's technically wrong but perl opts more for practical than 
> technical....
> 
> Dan the Man with Too Many SPAMs form CN and KR
> 

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Reply via email to