InLanguage properties? [Was Re: Encode-InCharset-0.01 Released]

Dan Kogai Fri, 03 May 2002 01:34:54 -0700

On Friday, May 3, 2002, at 04:33 , Roman Vasicek wrote:
>> On Friday, May 3, 2002, at 02:41 , Dan Kogai wrote:
>>
>> I have just released Encode-InCharset-0.01, available as
>>
>>  http://www.dan.co.jp/~dankogai/Encode-InCharset-0.01.tar.gz and CPAN.
>>
>> I have developed this module primarily to implement ISO-2022-JP-3 and 
>> ISO-2022-CN in future.  To implement encode() in these, you have to 
>> know which character set a given character belongs.  But this module 
>> can also be used if a string can safely be encoded
>> (Though fallback is much faster).
>>
> Great! Good work.
>
> I have one, may be off topic question. Is there module which provide the
> same functionality for languages? I mean something like IsGerman, 
> IsCzech,
> etc.


   Be our guest ;)  To my knowledge there is none but it won't be too 
hard to implement -- for Roman script languages.  You just start with 
ISO_8599 variants and subtract the ones you don't need.

   I consider this be one of the problems of Unicode (as of now).  When 
you aggregate anything, usually the source of origin is lost.  It is 
just the same as you can't retrieve 1+1 back from 2 (it could be 0+2 or 
-1+3 or anything).
   To overcome this shortage Unicode does have character properties and 
you can get which I<script> it belongs to using that.  But unfortunately 
that was not the case for the origins of character repertoire (so I made 
one (Encode-InCharset) because I needed it).  Neither is the case for 
Languages.
   Maybe Encode-InCharset-0.01 can help implement InLanguage, especially 
for complex CJK cases.  Here is a crude (and possibly incorrect) 
definition of InNihongo;

$InNihongo =~ qr/(?=
                                \p{InJISX0213_1} |
                                \p{InJISX0213_2} |
                                \p{InASCII}
                                )
                           (?:
                                \p{Hiragana} |
                                \p{Katakana} |
                                \p{Han} |
                                \p{InBasicLatin} | # contemporary!
                    )/xo;

Notice it is prepended by InJISX0213_1 and InJISX0213_2.  Otherwise all 
Han Ideographs that are not used in Japanese will also be considered 
Nihongo.


Dan the Encode Maintainer

InLanguage properties? [Was Re: Encode-InCharset-0.01 Released]

Reply via email to