Sei Heng Ang <[EMAIL PROTECTED]> writes: >2. I am currently writing a mail filter to extract >message body from the email. However, even though the >email charset has been defined as (eg) gb2312, it >still contain standard ASCII characters. Is there a >way where I can sort of convert the entire string into >unicode? Is there a module (library) that can >automatically recognize the individual characters in >the string and convert them accordingly?
Good Luck! - e-mail charset specifications are very patchy. Many mail clients "lie" about the content. Microsoft clients in particular use names of standard encodings when the mail contains a different encoding. e.g. they widely claim iso-8859-1 when they mean the microsoft code page which closely related but assigns values to 0x80..0x9F when ISO does not. The Encode::CN module has this to say about gb2312: " When you see C<charset=gb2312> on mails and web pages, they really mean C<euc-cn> encodings. To fix that, C<gb2312> is aliased to C<euc-cn>. Use C<gb2312-raw> when you really mean it. The ASCII region (0x00-0x7f) is preserved for all encodings, even though this conflicts with mappings by the Unicode Consortium." I have had some success displaying email using Encode::'s euc-cn and Unicode fonts, but as I can't read many chineese characters this was mainly just as an exercise. -- Nick Ing-Simmons http://www.ni-s.u-net.com/