On 07/08/2013 01:53 PM, ferdy.blat...@gmail.com wrote:
Hi Steven,

thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:
All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".
Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.

Characters do not have a width. They are Unicode code points, an abstraction. It's only when you encode them in byte strings that a code point takes on any specific width. And some encodings go to one-byte strings (and get errors for characters that don't match), some go to two-bytes each, some variable, etc.

I should have preferred another solution... but i'm not Guido....!

But Unicode has nothing to do with Guido, and it has existed for about 25 years (if I recall correctly). It's only that Python 3 is finally embracing it, and making it the default type for characters, as it should be. As far as I'm concerned, the only reason it shouldn't have been done long ago was that programs were trying to fit on 640k DOS machines. Even before Unicode, there were multi-byte encodings around (eg. Microsoft's MBCS), and each was thoroughly incompatible with all the others. And the problem with one-byte encodings is that if you need to use a Greek currency symbol in a document that's mostly Norwegian (or some such combination of characters), there might not be ANY valid way to do it within a single "character set."

Python 2 supports all the same Unicode features as 3; it's just that it defaults to byte strings. So it's HARDER to get it right.

Except for special purpose programs like a file dumper, it's usually unnecessary for a Python 3 programmer to deal with individual bytes from a byte string. Text files are a bunch of bytes, and somebody has to interpret them as characters. If you let open() handle it, and if you give it the correct encoding, it just works. Internally, all strings are Unicode, and you don't care where they came from, or what human language they may have characters from. You can combine strings from multiple places, without much worry that they might interfere.


Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS) from the beginning (approx 1992), and has had Unicode versions of each of its API's for nearly as long.

I appreciate you've been around a long time, and worked in a lot of languages. I've programmed professionally in at least 35 languages since 1967. But we've come a long way from the 6bit characters I used in 1968. At that time, we packed them 10 characters to each word.

--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to