On 10Aug2017 20:40, boB Stepp <robertvst...@gmail.com> wrote:
(By the way, it is nearly 14 years later, and PHP still believes that
the world is ASCII.)
I thought you must surely be engaging in hyperbole, but at
http://php.net/manual/en/xml.encoding.php I found:
"The default source encoding used by PHP is ISO-8859-1."
This kind of amounts to Python 2's situation in some ways: a PHP string or
Python 2 str is effectively just an array of bytes, treated like a lexical
stringy thing.
If you're working only in ASCII or _universally_ in some fixed 8-bit character
set (eg ISO8859-1 in Western Europe) you mostly get by if you don't look
closely. PHP's "default source encoding" means that the variable _character_
based routines in PHP (things that know about characters as letter, punctuation
etc) treat these strings as using IS8859-1 encoding. You can load UTF-8 into
these strings and work that way too (there's a PHP global setting for the
encoding).
Python 2 has a "unicode" type for proper Unicode strings.
In Python 3 str is Unicode text, and you use bytes for bytes. It is hugely
better, because you don't need to concern yourself about what text encoding a
str is - it doesn't have one - it is Unicode. You only need to care when
reading and writing data.
So long as your editor knows to save the file in UTF-8, it will Just
Work.
So Python 3's default behavior for strings is to store them as UTF-8
encodings in both RAM and files?
Not quite.
In memory Python 3 strings are sequences of Unicode code points. The CPython
internals pick an 8 or 16 or 32 bit storage mode for these based on the highest
code point value in the string as a space optimisation decision, but that is
concealed at the language level. UTF-8 as a storage format is nearly as
compact, but has the disadvantage that you can't directly index the string
(i.e. go to character "n") because UTF-8 uses variable length encodings for the
various code points.
In files however, the default encoding for text files is 'utf-8': Python will
read the file's bytes as UTF-8 data and will write Python string characters in
UTF-8 encoding when writing.
If you open a file in "binary" mode there's no encoding: you get bytes. But if
you open in text mode (no "b" in the open mode string) you get text, and you
can define the character encoding used as an optional parameter to the open()
function call.
No funny business anywhere? Except
perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
Python's fault. Which makes me wonder, what is my editor's default
encoding/decoding? I will have to investigate!
On most UNIX platforms most situations expect and use UTF-8. There aresome
complications because this needn't be the case, but most modern environments
provide UTF-8 by default.
The situation in Windows is more complex for historic reasons. I believe Eryk
Sun is the go to guy for precise technical descriptions of the Windows
situation. I'm not a Windows guy, but I gather modern Windows generally gives
you a pretty clean UTF-8 environment in most situations.
Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor