out utf-8 chars

Dave Angel Mon, 08 Jul 2013 14:03:37 -0700

On 07/08/2013 01:53 PM, ferdy.blat...@gmail.com wrote:

Hi Steven,


thank you for your reply... I really needed another python guru which
is also an English teacher! Sorry if English is not my mother tongue...
"uncorrect" instead of "incorrect" (I misapplied the "similarity
principle" like "unpleasant...>...uncorrect").

Apart from these trifles, you said:

All characters are UTF-8, characters. "a" is a UTF-8 character. So is "ă".

Not using python 3, for me (a programmer which was present at the beginning of
computer science, badly interacting with many languages from assembler to
Fortran and from c to Pascal and so on) it was an hard job to arrange the
abrupt transition from characters only equal to bytes to some special
characters defined with 2, 3 bytes and even more.

Characters do not have a width. They are Unicode code points, anabstraction. It's only when you encode them in byte strings that a codepoint takes on any specific width. And some encodings go to one-bytestrings (and get errors for characters that don't match), some go totwo-bytes each, some variable, etc.

I should have preferred another solution... but i'm not Guido....!

But Unicode has nothing to do with Guido, and it has existed for about25 years (if I recall correctly). It's only that Python 3 is finallyembracing it, and making it the default type for characters, as itshould be. As far as I'm concerned, the only reason it shouldn't havebeen done long ago was that programs were trying to fit on 640k DOSmachines. Even before Unicode, there were multi-byte encodings around(eg. Microsoft's MBCS), and each was thoroughly incompatible with allthe others. And the problem with one-byte encodings is that if you needto use a Greek currency symbol in a document that's mostly Norwegian (orsome such combination of characters), there might not be ANY valid wayto do it within a single "character set."

Python 2 supports all the same Unicode features as 3; it's just that itdefaults to byte strings. So it's HARDER to get it right.

Except for special purpose programs like a file dumper, it's usuallyunnecessary for a Python 3 programmer to deal with individual bytes froma byte string. Text files are a bunch of bytes, and somebody has tointerpret them as characters. If you let open() handle it, and if yougive it the correct encoding, it just works. Internally, all stringsare Unicode, and you don't care where they came from, or what humanlanguage they may have characters from. You can combine strings frommultiple places, without much worry that they might interfere.

Windows NT/2000/XP/Vista/7 has used Unicode for its file system (NTFS)from the beginning (approx 1992), and has had Unicode versions of eachof its API's for nearly as long.

I appreciate you've been around a long time, and worked in a lot oflanguages. I've programmed professionally in at least 35 languagessince 1967. But we've come a long way from the 6bit characters I usedin 1968. At that time, we packed them 10 characters to each word.


--
DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: hex dump w/ or w/out utf-8 chars

Reply via email to