mouss wrote: >> However, it is true that the vast majority of the corpus currently >> comes from >> folks who speak English (King's or Yankee) as a primary language, and >> that's a >> bit of a problem as it creates considerable bias in the rules. >> >> And even us US folks do have encoding issues. After all, English is >> not our >> official language here in the US, > > what do you mean here? what would be your official language?
The United States of America does not have any official language. Americanized English is our common language, but it's not official. This means that our government has to supply forms and materials in many languages for its citizens, because it cannot require that citizens speak English. For example, we have tax forms in French: http://www.irs.gov/pub/irs-access/f2290fr_accessible.pdf Admittedly non-english forms and services are somewhat secondary here, but they are present. > > and I've got plenty of users that speak >> multiple languages, not all of which use plain-ascii. >> > > I guess so. now I'm not sure our situation isn't worst because people > tried to find non standard solutions that are still used. I still > remember the days when some customers were asking us to "fix" our > software because "it broke their accents"... hopefully these times are > gone, but I still see "broken" mail (much more than I should). actually, > I also see mail that doesn't get rendered correctly on thunderbird. so > I'll admit that the issue isn't really about accented chars... > Well, yours is certainly worse, or at least more prevalent, than the problem here in the US, but I would not say it's the worst. Generally speaking the worst case seems to be present in smaller Asian nations, which have really extensive use of non-us characters. At least the French can restrict their text to the same character set as English and still be readable, although awkward due to the screwed up accents. Also, smaller Asian nations still to this day have a high prevalence of locally-grown mail clients, many of which are not even remotely RFC compliant, but work well with others in the same locale. They're also much more likely to make use of mixed-language text containing many character sets. Speaking 2 or 3 different languages is fairly common in the smaller countries of the Asian region, just due to necessity for trade with neighboring countries. Another area with this same basic issue would be the middle-east, but the number of completely different character sets is smaller.