Denis Kasak <denis.ka...@gmail.com> writes: > > > > Python "assumes" ASCII and if the decodes/encoded text doesn't > > > fit that encoding it refuses to guess. > > > > Which is reasonable given that Python is programming language where it's > > better to have more conservative assumption about encodings so errors > > can be more quickly diagnosed. A newsreader however is a different > > beast, where it's better to make a less conservative assumption that's > > more likely to display messages correctly to the user. Assuming ISO > > 8859-1 in the absense of any specified encoding allows the message to be > > correctly displayed if the character set is either ISO 8859-1 or ASCII. > > Doing things the "pythonic" way and assuming ASCII only allows such > > messages to be displayed if ASCII is used. > > Reading this paragraph, I've began thinking that we've misunderstood > each other. I agree that assuming ISO 8859-1 in the absence of > specification is a better guess than most (since it's more likely to > display the message correctly).
So, yeah--back on the subject of programming in Python and supporting charactersets beyond ASCII: If you have to make an assumption, I'd really think that it'd be better to use whatever the host OS's default is, if the host OS has such a thing--using an assumption of ISO 8859-1 works only in select regions on unix systems, and may fail even in those select regions on Windows, Mac OS, and other systems; without the OS considerations, just the regional constraints are likely to make an ISO-8859-1 assumption result in /incorrect/ results anywhere eastward of central Europe. Is a user in Russia (or China, or Japan) *really* most likely to be using ISO 8859-1? As a point of reference, here's what's in the man-pages that I have installed (note the /complete/ and conspicuous lack of references to even some notable eastern languages or character-sets, such as Chinese and Japanese, in the /entire/ ISO-8859 spectrum): "ISO 8859 Alphabets The full set of ISO 8859 alphabets includes: ISO 8859-1 West European languages (Latin-1) ISO 8859-2 Central and East European languages (Latin-2) ISO 8859-3 Southeast European and miscellaneous languages (Latin-3) ISO 8859-4 Scandinavian/Baltic languages (Latin-4) ISO 8859-5 Latin/Cyrillic ISO 8859-6 Latin/Arabic ISO 8859-7 Latin/Greek ISO 8859-8 Latin/Hebrew ISO 8859-9 Latin-1 modification for Turkish (Latin-5) ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6) ISO 8859-11 Latin/Thai ISO 8859-13 Baltic Rim languages (Latin-7) ISO 8859-14 Celtic (Latin-8) ISO 8859-15 West European languages (Latin-9) ISO 8859-16 Romanian (Latin-10)" "ISO 8859-1 supports the following languages: Afrikaans, Basque, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish." "ISO 8859-2 supports the following languages: Albanian, Bosnian, Croatian, Czech, English, Finnish, German, Hungarian, Irish, Polish, Slovak, Slovenian and Sorbian." "ISO 8859-7 encodes the characters used in modern monotonic Greek." "ISO 8859-9, also known as the "Latin Alphabet No. 5", encodes the characters used in Turkish." "ISO 8859-15 supports the following languages: Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, and Swedish." "ISO 8859-16 supports the following languages: Albanian, Bosnian, Croatian, English, Finnish, German, Hungarian, Irish, Polish, Romanian, Slovenian and Serbian." -- Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr)))). -- http://mail.python.org/mailman/listinfo/python-list