Nick Kew wrote:
> On Wednesday 12 October 2005 04:31, Paul Querna wrote:
> 
>>>An outline of what needs to be done can be found here:
>>>
>>>  http://intertwingly.net/stories/2005/09/28/xchar.rb
> 
> Erm, no.  We need to reencode from any incoming charset.
> We don't need to reinvent any wheels by recreating individual
> charset conversion tables.

There are two special cases that merits consideration.

If *after* you convert to unicode, you end up with

1) Characters that are outside the valid range for XML then
   they must be replaced:

      0x9, 0xA, 0xD,
      (0x20..0xD7FF),
      (0xE000..0xFFFD),
      (0x10000..0x10FFFF)

   The most common character that causes such a problem is
   a form-feed character, common in RFC's for example.

2) Characters in the range of (0x80..0x9F) are either reserved or
   are control characters.  27 of these characters were "embraced
   and extended" by our friends in Redmond.  That's the single
   table that you so viscerally reacted to.

   The most common characters that cause such problems are
   the so-called smart-quotes.

>>Right now mod_mbox does *no* encoding translation.  We really need to be
>>calling apr_xlate all over, and turning everything into UTF-8 First.
>>Currently, each item is encoded in whatever the client program sent it
>>as... which isn't good. 
> 
> Even the HTML is erroneously sent as iso-8859-1, so posts that arrive as
> utf-8 (eg from wrowe) display incorrectly!  As of now it's not really fit for 
> purpose.  We should fix this for the benefit of all display formats, rather
> than address html, atom, or indeed anything else in isolation.

One possibility is to convert characters about 0xFF to numeric character
references, like ’.  Even though it it wrong to do so, people
often consume feeds with regular expressions, "aggregate" bits from
various places using the equivalent of strcat, and toss the results into
a web page, leaving the default as iso-8859-1.  Numeric character
references have the benefit of meaning the same thing independent of
whether the bytes are interpreted as iso-8859-1, utf-8, or even us-ascii.

> Regarding the mail archives, the ideal solution would be to transcode
> incoming messages to a homogenous utf-8 before storing them.  To make
> that useful, we'd need to transcode the existing archives too, though that
> would just be a one-off script.  I see a mod_smtpd filter thrashing around
> that to-do list ...  dammit, it's the long-awaited updates to charset_lite!

Just mentioning in passing: if you have a message of uncertain encoding,
there is a regular expression that can be used to determine if it is
likely in utf-8 already.  Given the design of utf-8, false positives are
rare, and the chances drop as the length of the message increases.

> The harder bit to deal with is _local_ encoding in a different charsets in
> header lines.  That's a PITA, and is AFAIK peculiar to SMTP.

- Sam Ruby

Reply via email to