On 05/04/2010 07:51 AM, Steffen Kaiser wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Mon, 3 May 2010, Philip Prindeville wrote: > >> The problem is this: the message will be intelligible to English >> language readers, but it will generate a lot of false positives for >> mailing list recipients who aren't expecting to get non-English >> messages (or English messages encoded in anything other than USASCII, >> ISO-8895-1, or UTF-8). > > :-) > >> If the message body is Content-Type: text/plain; charset=xxxx should >> it be squashed down in the case of mailing list traffic for English >> language mailing lists? > > Nice idea. To make it really work, you should exempt the signature. > Meaning, there are people using their native spelling as > name.
Found the reference, if anyone cares... RFC 2046, last paragraph of section 4.1.2: In general, composition software should always use the "lowest common denominator" character set possible. For example, if a body contains only US-ASCII characters, it SHOULD be marked as being in the US- ASCII character set, not ISO-8859-1, which, like all the ISO-8859 family of character sets, is a superset of US-ASCII. More generally, if a widely-used character set is a subset of another character set, and a body contains only characters in the widely-used subset, it should be labelled as being in that subset. This will increase the chances that the recipient will be able to view the resulting entity correctly. > >> use Encode::First qw(encode_first); >> >> my $encodings = join('ascii', 'latin1', 'utf-8', $oldcharset); > my $encodings = join(',', 'ascii', 'latin1', 'utf-8'); > > "utf8" matches always, IMHO, but first you have to decode() the content, > which BTW I found problematic in its own, that's why I'm using > a "decode_first"-like function: > > try decode with supplied charset, then check if it is good utf8, > then decode as latin1, which matches always. > > That's the same with your sequence: $oldcharset will never reached > because you can always encode to 'utf-8'. Ok, right. So try the first two... And if they don't work, then transcode to utf-8. > >> my ($newcharset, $newlen) = encode_first($encodings, $string); >> >> if ($newlen<= length($string)) { >> # use $newstr instead >> } > > This check does not fit, IMHO: If you have a real, 7bit clean ASCII > message, it should be the same in any other multi-byte or 8bit > encodings, because they use ASCII as bases, don't they? Ok, so: use Encode::First qw(encode_first); # also need to handle aliases... if ($oldcharset eq 'ascii' || $oldcharset eq 'latin1' || $oldcharset eq 'utf-8') { ; } else { my $encodings = join(',', 'ascii', 'latin1', $oldcharset); my ($newcharset, $newlen) = encode_first($encodings, $string); if ($newcharset eq $oldcharset) { $newcharset = 'utf-8'; } # transcode as $newcharset } > > Your goal is to hide the Asian charset for English messages, > therefore I would use: > > my %goodCharset = ( qw/ascii latin1 iso-8856-1/ ); > if(!$goodCharset{lc $oldcharset} && $goodCharset{lc $newcharset}) { > # replace body > } > UTF-8 does not do any good, but hides the Asian font :-) > > Regards, > > - -- Steffen Kaiser Well, not just Asian... Koi-8, Cyrillic, etc. as well. And all of those windows-xxxx abominations. -Philip _______________________________________________ NOTE: If there is a disclaimer or other legal boilerplate in the above message, it is NULL AND VOID. You may ignore it. Visit http://www.mimedefang.org and http://www.roaringpenguin.com MIMEDefang mailing list MIMEDefang@lists.roaringpenguin.com http://lists.roaringpenguin.com/mailman/listinfo/mimedefang