Re: Suggestions for improving MHA's i18n support

Mooffie Thu, 12 Sep 2002 16:38:49 -0700

On Wednesday 11 September 2002 09:29 pm, Earl Hood wrote:
> On September 11, 2002 at 03:50, Mooffie wrote:
> > Currently, charset conversion routines are not applied to HTML messages.
. . .
> > My suggestion is to apply the charset conversion routines to HTML
> > messages as well.
. . .
> A thing to consider is to have a pre-filtering step for any
> text/* type that allows for a pre-conversion processing before a text
> entity is passed to a filter.


Yes, that's a better idea.

> This relates to the TODO item of having
> chained filters.

Chained? You mean that they'll run one after the other?

If you examine a filter in HSMA, you'll see that its structure is:

sub my_new_HTML_filter {
        # 1. do some processing...
        # 2. call MHA's HTML filter
        # 3. do some more processing...
}

And this looks more usefull than a simple chain.

> > 2.
> >
> > There are two instances where we (and MHA) don't know the charset of the
> > data:
. . .
> > Since MHA doesn't know the charset of the data, a UTF-8 conversion can't
> > be carried out.
. . .
> > My suggestion is to create a new resource, <DefaultCharset>, that allows
> > one to specify a default charset.
. . .
> This is a nice shortcut to something like the following that has
> the same effect:
>
>     <CharsetConverters>
>     plain; MyMHADefault::str2html; MyMHADefault.pm
>     </CharsetConverters>
>
> In MyMHADefault.pm:
>
>     package MyMHADefault;
>     require 'readmail.pl';
>
>     my $default_charset = 'iso-8859-8';
>     sub str2html {
>       my $charcnv = readmail::load_charset($default_charset);
>       $charcnv($_[0], $default_charset);
>     }
>     1;

That's right, but you can't expect the user to be a Perl programmer... so this 
isn't merely a "shortcut".

> > 3.
> >
> > Misconfigured MUAs, including some web-mails, may declare an incorrect
> > charset.
. . .
> > My suggestion is to create a new resource, <CharsetAliases>, to have MHA
> > treat some charsets as others. Then, for example, if I have a Hebrew
> > mailing list, I'd write:
> >
> > <CharsetAliases>
> > iso-8859-8;  us-ascii iso-8859-1 iso-8859-8-i x-unknown x-user-defined
> > </CharsetAlises>
. . .
> You can do the following to get a similiar effect:
>
> <CharsetConverters>
> us-ascii; MyHebrewConverter::str2html; MyHebrewConverter.pm
> iso-8859-1; MyHebrewConverter::str2html; MyHebrewConverter.pm
> iso-8859-8; MyHebrewConverter::str2html; MyHebrewConverter.pm
> iso-8859-8-i; MyHebrewConverter::str2html; MyHebrewConverter.pm
> x-unknown; MyHebrewConverter::str2html; MyHebrewConverter.pm
> x-user-defined; MyHebrewConverter::str2html; MyHebrewConverter.pm
> </CharsetConverters>

Your suggestion may fail. Imagine the following situation:

Let's say MyHebrewConverter::str2html() converts its input to UTF-8.

Now, I have iso-8859-8 data which is incorrectly declared (by the MUA) as 
"us-ascii". MyHebrewConverter::str2html() will see "us-ascii" in the 
"charset" argument and thus the conversion will fail -- because it's not 
"us-ascii".

However, if "us-ascii" is an alias for "iso-8859-8" (using <CharsetAliases>), 
MyHebrewConverter::str2html() will think that the data is "iso-8859-8", which 
it really is. 

> I guess the main sematic difference is that if CharsetAliases
> were used, if mhonarc sees "us-ascii", when it calls
> MyHebrewConverter::str2html, it will actually pass in "iso-8859-8"
> as $charset instead of "us-ascii"

Exactly!

In HSMA I indeed used the above two ways to solve the problem:

1. I provided a resource file with the elaborate <CharsetConverters> resource 
you just gave, for use when the user wants the archive encoding to be 
windows-1255 (which is a 8-bit encoding), 

2. Because the user may want the archive encoding to be UTF-8, I had to 
include an aliases table in the code that does the UTF-8 conversion.

A <CharsetAliases> resource can solve this problem. You may have got the 
impression that it's only usefull for buggy MUAs, but that's not so. For 
example, Hebrew messages may be marked as either "iso-8859-8" or 
"iso-8859-8-i" (both are standardized, and both stand for the same encoding, 
but the first, which is deprecated nowadays, stands for "Visual Hebrew", and 
the later for "Logical Hebrew" (the meaning is not important for our 
discussion)). What I want is to let MHA know that "iso-8859-8-i" is actually 
the standard "iso-8859-8" encoding, and a <CharsetAliases> mechanism looks 
like an elegant solution.

Note that my suggestions are intended to help the ordinary user that manages 
MHA (the "administrator"). If all users were programmers, they wouldn't need 
them (albeit they'd have to spend a lot of time coding).

> I can see the usefulness of it, especially when used with the existing
> converters.

Excellent.

> > 4.
> >
> > I see that UTF8.pm includes a few hard-coded aliases (e.g.
> > "windows-1250" --> "cp1250"). It might be possible to extend
> > <CharsetAliases> to have this function too; for example:
> >
> > <CharsetAliases>
> > cp1250; windows-1250
> > . . .
> > cp1255; windows-1255
> > . . .
> > apple-hebrew; x-mac-hebrew
> > </CharsetAlises>
>
> True.

Actually, it would be silly for <CharsetAliases> not to have this function, 
because the intent is to spare the user editing the source.

> > 5.
> >
> > Although UTF-8 has its advantages, some administrators might prefer
> > their national 8-bit encoding (because it requires less disk space,
> > because they already have 3rd party tools that work with it (e.g. search
> > tools), etc). It seems that it won't be difficult to create a new
> > conversion routine (one can start from MHonArc::UTF8::str2sgml) that
> > converts everything to a common arbitrary encoding, which can be a 8-bit
> > based one. A new resource, e.g. <TargetEncoding> or <ArchiveEncoding>,
> > could determine this target encoding (which could also be "utf-8"(!), so
> > this routine could eventually obsolete MHonArc::UTF8::str2sgml).
>
> Can't someone achieve something similiar with:
>
> <DecodeHeads>
> <CharsetConverters override>
> plain;          mhonarc::htmlize;
> default;        -decode-
> </CharsetConverters>

No, the intent is to convert all messages to a common encoding the user 
specifies. You already have a function, MHonArc::UTF8::str2sgml, that 
converts all messages to UTF-8. My suggestion is to extend this function so 
that any target encoding is possible, not just UTF-8.

> Of course, it does not allow one to explicitly specify a target
> encoding to allow for "smart" conversion from one charset to the
> final one since whatever is registered for the "plain" set is not
> provided any information on what the source format really is.

Yes, but this can be solved with the <DefaultCharset> resource.

> Ah, my SGML background is showing through.  The named entities are
> standard in SGML, and unfortunately, never adopted by HTML.  I knew
> this would eventually be a problem.
>
> The named entities have the advantage of being usable across character
> sets while numeric are tied to the current character set in use.

No, numeric character references are _independent_ of the encoding of the 
page, because they specify the unicode number of the character.

That is, "&#254;" is always the the letter Thorn, no matter what the encoding 
is.

A good explanation can be found in the HTML spec:

http://www.w3.org/TR/REC-html40/charset.html

And while you're at it, please check the folowing list:

http://www.w3.org/TR/REC-html40/sgml/entities.html

> I welcome numeric character entity reference mappings using the
> &#x...; notation.  This should work for most modern browsers and
> avoid have to have dependencies on other modules (at least in the
> default configuration).

Since the named entities don't work anyway (well... the bulk of them), using 
numeric character references can't be worse :-)

> A major goal, and I think one reason for it usefulness, is that it
> should be easy to install and get going.

I agree with you.

I urge you to read the HSMA source and *.mrc files in order to understand the 
rationales behind my suggestions. In the code I dealt with the above issues, 
issues that are not at all specific to Hebrew, and which, in my opinion, 
should be handled by MHA itself.


---------------------------------------------------------------------
To sign-off this list, send email to [EMAIL PROTECTED] with the
message text UNSUBSCRIBE MHONARC-DEV

Re: Suggestions for improving MHA's i18n support

Reply via email to