On December 3, 2002 at 13:16, "Takashi P.KATOH" wrote: > From: Earl Hood <[EMAIL PROTECTED]> > Subject: RFC: Japanese Text Conversion and other language issues > Date: Sat, 30 Nov 2002 22:23:13 -0600 > > Should MHonArc::CharEnt replace iso2022.pl as the default > > CHARSETCONVERTER for iso-2022-jp text? > > I prefer not to replace it. > The reasons are: > > (1) I think Namazu cannot treat Unicode character entity > references as it is, so changing the default might > confuse MHonArc+Namazu users. > (In fact, this statement is not accurate. I'll describe > more details later in this mail).
Thanks for looking into it. I think Unicode character entity references are important for all languages. For example, it is quite common to use Unicode and/or numeric character entity references for latin-based languages. HTML 4.0 only defines a small set of named entities. > (2) Human unreadable (i.e., poor maintainability) > Imagine if `Hello' written as > `Hello'. > You might say `The files generated by MHonArc don't need > to be viewed except via web browsers'. > Nevertheless, it is also true that sometimes I needed to > see them for maintenance. I understand the need to view the raw HTML, but I think this is an issue with a select few, and only admin/tech types. Your comment would also apply if all data is in UTF-8 (unless of course you have access to a UTF-8-aware editor/viewer). ASCII text will be left as-is, but you are correct that Japanese characters will all be represented as &#HHHH;, making it hard to read the raw data. > (3) Some softwares cannot read it. > This is also concerning maintainability. Yep, but it may be a hit that needs to be taken in order to solve charset soup. BTW, can you provide some real-world example software (besides Namazu)? > > MHonArc::CharEnt tries to map everything to HTML entity references, > > allowing for the ability of multiple languages to co-exist. > > Yes, this is a great (and admirable) advantage. > But, fortunately or unfortunately, we have few multiple > languages co-existing messages. I agree that for many locales, archives tend to contain messages of that locale. However, I'm also trying to consider users that run large archives of multiple lists comprised of multiple languages. > I recognized that another advantage to use entity > references: We can use Kanji characters in rc file. > For example, we might want to write `Next' in Japanese like > this: > > <NextButton chop> > [<a href="$MSG(NEXT)$">ESC-$-B < ! ESC-(-B</a> ($MSG(NEXT)$)] > </NextButton> > > but this does not work (second resource variable won't be > expanded) because `$' is included in Kanji. > (This example is somewhat contrived because I needed a > resource variable AFTER Kanji.) Have your tried using the VARREGEX resource to minimize rc file conflicts? > I've not checked yet, but I think we can use Kanji > Characters in rc file if we use MHonArc::CharEnt. > (I don't know if we need to write it as entity references, > though.) MHonArc::CharEnt and rc files are independent. I.e. MHonArc::CharEnt knows nothing about processing rc files and vice-versa. Hence, you could use character entity references in your rc files and still use iso2022jp.pl for converting message text. > Finally, I should tell you that these are my personal > opinion, and I don't know what other Japanese users think. All opinions count, and I appreciate your response. I have no real problem leaving iso2022jp.pl as the default. I'll just have to add something in the docs about it and that MHonArc::CharEnt can be used if desired. Something I can add to the release notes. > I'm planning to write Earl's RFC in my web page (in > Japanese) to ask for other users' opinion. Thanks. Make sure to note that iso2022jp.pl is NOT going away. Hence, users can specify explicitly if they want to make sure it is used. On a related note, I'm going to see about adding a new resource called MSGTXTENCODE that would give users the ability to pre-convert all message text entities to the specified encoding (an idea suggested by Moofie). The resource will only work if Unicode::MapUTF8 or Encode module is installed, which means only those using Perl >= 5.6 will be able to use the feature. The reason for the resource is that other text/* types can introduce foreign character encodings (e.g. text/html). Therefore, a user would still not be able to complete control the character encoding of archive pages (unless they only allow text/plain messages). > However, please wait for a few days because I have a bad > cold now... Hope you get better, --ewh --------------------------------------------------------------------- To sign-off this list, send email to [EMAIL PROTECTED] with the message text UNSUBSCRIBE MHONARC-DEV
