subject:"Re\: unicode by default"

Re: unicode by default

2011-05-14 Thread harrismh777


Terry Reedy wrote:

Is there a unix linux package that can be installed that
drops at least 'one' default standard font that will be able to render
all or 'most' (whatever I mean by that) code points in unicode? Is this
a Python issue at all?


Easy, practical use of unicode is still a work in progress.


Apparently...  the good news for me is that SBL provides their unicode 
font here:


 http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain 
with unicode fonts is that the glyph is tied to the code point for the 
represented character, and not tied to any code point that matches any 
keyboard scan code for typing.   :-}


So, I can now see the ancient text with accents and aparatus in all of 
my editors, but I still cannot type any ancient Greek with my 
keyboard... because I have to make up a keymap first. sigh


I don't find that SBL (nor Logos Software) has provided keymaps as 
yet...  rats.


I can read the test with Python though... ye.


m harris
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-14 Thread Nobody

On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote:

 The unicode consortium is very careful to make sure that thousands
 of symbols have a unique code point (that's great !) but how do these
 thousands of symbols actually get displayed if there is no font
 consortium?   Are there collections of 'standard' fonts for unicode that I
 am not aware?  Is there a unix linux package that can be installed that
 drops at least 'one' default standard font that will be able to render all
 or 'most' (whatever I mean by that) code points in unicode?

Using the original meaning of font (US) or fount (commonwealth), you
can't have a single font cover the whole of Unicode. A font isn't a random
set of glyphs, but a set of glyphs in a common style, which can only
practically be achieved for a specific alphabet.

You can bundle multiple fonts covering multiple repertoires into a single
TTF (etc) file, but there's not much point.

In software, the term font is commonly used to refer to some ad-hoc
mapping between codepoints and glyphs. This typically works by either
associating each specific font with a specific repertoire (set of
codepoints), or by simply trying each font in order until one is found
with the correct glyph.

This is a sufficiently common problem that the FontConfig library exists
to simplify a large part of it.

   Is this a Python issue at all?

No.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-14 Thread jmfauth

On 14 mai, 09:41, harrismh777 harrismh...@charter.net wrote:

 ...
 I'm getting much closer here,
 ...

You should really understand, that Unicode is a domain per
se. It is independent from any os's, programming languages
or applications. It is up to these tools to be unicode
compliant.

Working in a full unicode mode (at least for texts) is
today practically a solved problem. But you have to ensure
the whole toolchain is unicode compliant (editors,
fonts (OpenType technology), rendering devices, ...).

Tip. This list is certainly not the best place to grab
informations. I suggest you start by getting informations
about XeTeX. XeTeX is the new TeX engine working only
in a unicode mode. From this starting point, you will
fall on plenty web sites speaking about the unicode
world, tools, fonts, ...

A variant is to visit sites speaking about *typography*.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-14 Thread Terry Reedy


On 5/14/2011 3:41 AM, harrismh777 wrote:

Terry Reedy wrote:



Easy, practical use of unicode is still a work in progress.


Apparently... the good news for me is that SBL provides their unicode
font here:

http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain
with unicode fonts is that the glyph is tied to the code point for the
represented character, and not tied to any code point that matches any
keyboard scan code for typing. :-}

So, I can now see the ancient text with accents and aparatus in all of
my editors, but I still cannot type any ancient Greek with my
keyboard... because I have to make up a keymap first. sigh

I don't find that SBL (nor Logos Software) has provided keymaps as
yet... rats.


You need what is called, at least with Windows, an IME -- Input Method 
Editor. These are part of (or associated with) the OS, so they can be 
used with *any* application that will accept unicode chars (in whatever 
encoding) rather than just ascii chars. Windows has about a hundred or 
so, including Greek. I do not know if that includes classical Greek with 
the extra marks.



I can read the test with Python though... ye.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-14 Thread Ben Finney

Terry Reedy tjre...@udel.edu writes:

 You need what is called, at least with Windows, an IME -- Input Method
 Editor.

For a GNOME or KDE environment you want an input method framework; I
recommend IBus URL:http://code.google.com/p/ibus/ which comes with the
major GNU+Linux operating systems URL:http://oswatershed.org/pkg/ibus
URL:http://packages.debian.org/squeeze/ibus .

Then you have a wide range of input methods available. Many of them are
specific to local writing systems. For writing special characters in
English text, I use either ‘rfc1345’ or ‘latex’ within IBus.

That allows special characters to be typed into any program which
communicates with the desktop environment's input routines. Yay, unified
input of special characters!

Except Emacs :-( which fortunately has ‘ibus-el’ available to work with
IBus URL:http://www.emacswiki.org/emacs/IBusMode :-).

-- 
 \ 己所不欲、勿施于人。|
  `\(What is undesirable to you, do not do to others.) |
_o__) —孔夫子 Confucius, 551 BCE – 479 BCE |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-13 Thread jmfauth

On 12 mai, 18:17, Ian Kelly ian.g.ke...@gmail.com wrote:

 ...
 to worry about encodings are when you're encoding unicode characters
 to byte strings, or decoding bytes to unicode characters


A small but important correction/clarification:

In Unicode, unicode does not encode a *character*. It
encodes a *code point*, a number, the integer associated
to the character.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-13 Thread harrismh777


jmfauth wrote:

to worry about encodings are when you're encoding unicode characters
  to byte strings, or decoding bytes to unicode characters


A small but important correction/clarification:

In Unicode, unicode does not encode a*character*. It
encodes a*code point*, a number, the integer associated
to the character.



That is a huge code-point... pun intended.

... and there is another point that I continue to be somewhat puzzled 
about, and that is the issue of fonts.


   On of my hobbies at the moment is ancient Greek (biblical studies, 
Septuaginta LXX, and Greek New Testament).  I have these texts on my 
computer in a folder in several formats... pdf, unicode 'plaintext', 
osis.xml, and XML.


   These texts may be found at http://sblgnt.com

   I am interested for the moment only in the 'plaintext' stream, 
because it is unicode.  ( first, in unicode, according to all the doc 
there is no such thing as 'plaintext,' so keep that in mind).


   When I open the text stream in one of my unicode editors I can see 
'most' of the characters in a rudimentary Greek font with accents; 
however, I also see many tiny square blocks indicating (I think) that 
the code points do *not* have a corresponding character in my unicode 
font for that Greek symbol (whatever it is supposed to be).


   The point, or question is, how does one go about making sure that 
there is a corresponding font glyph to match a specific unicode code 
point for display in a particular terminal (editor, browser, whatever) ?


   The unicode consortium is very careful to make sure that thousands 
of symbols have a unique code point (that's great !) but how do these 
thousands of symbols actually get displayed if there is no font 
consortium?   Are there collections of 'standard' fonts for unicode that 
I am not aware?  Is there a unix linux package that can be installed 
that drops at least 'one' default standard font that will be able to 
render all or 'most' (whatever I mean by that) code points in unicode? 
 Is this a Python issue at all?



kind regards,
m harris




--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-13 Thread Robert Kern


On 5/13/11 2:53 PM, harrismh777 wrote:


The unicode consortium is very careful to make sure that thousands of symbols
have a unique code point (that's great !) but how do these thousands of symbols
actually get displayed if there is no font consortium? Are there collections of
'standard' fonts for unicode that I am not aware?


There are some well-known fonts that try to cover a large section of the Unicode 
standard.


  http://en.wikipedia.org/wiki/Unicode_typeface


Is there a unix linux package
that can be installed that drops at least 'one' default standard font that will
be able to render all or 'most' (whatever I mean by that) code points in
unicode? Is this a Python issue at all?


Not really.

--
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-13 Thread Terry Reedy


On 5/13/2011 3:53 PM, harrismh777 wrote:


The unicode consortium is very careful to make sure that thousands of
symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that I
am not aware? Is there a unix linux package that can be installed that
drops at least 'one' default standard font that will be able to render
all or 'most' (whatever I mean by that) code points in unicode? Is this
a Python issue at all?


Easy, practical use of unicode is still a work in progress.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread harrismh777


John Machin wrote:

On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:


If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(),


No such attribute. Perhaps you mean locale.getpreferredencoding()



 import locale
 locale.getpreferredencoding()
'UTF-8'


Ye!


:)

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread harrismh777


Ben Finney wrote:

I'd phrase that as:



* Text is a sequence of characters. Most inputs to the program,
   including files, sockets, etc., contain a sequence of bytes.



* Always know whether you're dealing with text or with bytes. No object
   can be both.



* In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
   the type for text.



* In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
   sequence of bytes.



That is very helpful...   thanks


MRAB, Steve, John, Terry, Ben F, Ben K, Ian...
   ...thank you guys so much, I think I've got a better picture now of 
what is going on... this is also one place where I don't think the books 
are as clear as they need to be at least for me...(Lutz, Summerfield).


So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is 
based on locale... in my case UTF-8  ...that is enormously helpful for 
me... understanding locale on this system is as mystifying as unicode is 
in the first place.
Well, after reading about unicode tonight (about four hours) I realize 
that its not really that hard... there's just a lot of details that have 
to come together. Straightening out that whole tower-of-babel thing is 
sure a pain in the butt.
I also was not aware that UTF-8 chars could be up to six(6) byes long 
from left to right.  I see now that the little-endianness I was 
ascribing to python is just a function of hexdump... and I was a little 
disappointed to find that hexdump does not support UTF-8, just ascii...doh.

Anyway, thanks again... I've got enough now to play around a bit...

PS thanks Steve for that link, informative and entertaining too... Joe 
says, If you are a programmer . . . and you don't know the basics of 
characters, character sets, encodings, and Unicode, and I catch you, I'm 
going to punish you by making you peel onions for 6 months in a 
submarine. I swear I will. :)









kind regards,
m harris





--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread harrismh777


Terry Reedy wrote:

It does not matter how Python stored the unicode internally. Does this
help? Your intent is signalled by how you open the file.


Very much, actually, thanks.  I was missing the 'internal' piece, and 
did not realize that if I didn't specify the encoding on the open that 
python would pull the default encoding from locale...



kind regards,
m harris

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread John Machin

On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:


 So, the UTF-16 UTF-32 is INTERNAL only, for Python

NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.

 I also was not aware that UTF-8 chars could be up to six(6) byes long
 from left to right.

It could be, once upon a time in ISO faerieland, when it was thought that
Unicode could grow to 2**32 codepoints. However ISO and the Unicode
consortium have agreed that 17 planes is the utter max, and accordingly a
valid UTF-8 byte sequence can be no longer than 4 bytes ... see below

 chr(17 * 65536)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: chr() arg not in range(0x11)
 chr(17 * 65536 - 1)
'\U0010'
 _.encode('utf8')
b'\xf4\x8f\xbf\xbf'
 b'\xf5\x8f\xbf\xbf'.decode('utf8')
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\python32\lib\encodings\utf_8.py, line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0:
invalid start byte


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread TheSaint

John Machin wrote:

 On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:

 If the file you're writing to doesn't specify an encoding, Python will
 default to locale.getdefaultencoding(),
 
 No such attribute. Perhaps you mean locale.getpreferredencoding()

what about sys.getfilesystemencoding()
In the event to distribuite a program how to guess which encoding will the 
user has?


-- 
goto /dev/null
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread Ian Kelly

On Thu, May 12, 2011 at 1:58 AM, John Machin sjmac...@lexicon.net wrote:
 On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:


 So, the UTF-16 UTF-32 is INTERNAL only, for Python

 NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
 encodings for the EXTERNAL representation of Unicode characters in byte
 streams.

Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.
However, this is entirely transparent.  To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all.  The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread Terry Reedy


On 5/12/2011 12:17 PM, Ian Kelly wrote:

On Thu, May 12, 2011 at 1:58 AM, John Machinsjmac...@lexicon.net  wrote:

On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:



So, the UTF-16 UTF-32 is INTERNAL only, for Python


NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
encodings for the EXTERNAL representation of Unicode characters in byte
streams.


Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.


I know some people say that, but according to the definitions of the 
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent 
chars in the Supplementary Planes. The later (1996) UTF-16, which Python 
uses, can. The standard considers 'UCS-2' obsolete long ago. See


https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

The latter says: Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode 
implementation up to Unicode 1.1, before surrogate code points and 
UTF-16 were added to Version 2.0 of the standard. This term should now 
be avoided.


It goes on: Sometimes in the past an implementation has been labeled 
UCS-2 to indicate that it does not support supplementary characters 
and doesn't interpret pairs of surrogate code points as characters. Such 
an implementation would not handle processing of character properties, 
code point boundaries, collation, etc. for supplementary characters.


I know that 16-bit Python *does* use surrogate pairs for supplementary 
chars and at least some properties work for them. I am not sure exactly 
what the rest means.



However, this is entirely transparent.  To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all.  The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.


If one uses unicode chars in the Supplementary Planes above the BMP (the 
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16), 
then the abstraction leaks.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-12 Thread Ian Kelly

On Thu, May 12, 2011 at 2:42 PM, Terry Reedy tjre...@udel.edu wrote:
 On 5/12/2011 12:17 PM, Ian Kelly wrote:
 Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
 same thing as UTF-16, by the way) to represent Unicode strings.

 I know some people say that, but according to the definitions of the unicode
 consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
 Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
 standard considers 'UCS-2' obsolete long ago. See

 https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
 or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
Unicode produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems.

PEP 100 says:

The internal format for Unicode objects should use a Python
specific fixed format PythonUnicode implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits).  Byte
order is platform dependent.

This format will hold UTF-16 encodings of the corresponding
Unicode ordinals.  The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.

It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption.  The
constructor does not check the data for Unicode compliance or use
of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Ian Kelly

On Wed, May 11, 2011 at 3:37 PM, harrismh777 harrismh...@charter.net wrote:
 hi folks,
   I am puzzled by unicode generally, and within the context of python
 specifically. For one thing, what do we mean that unicode is used in python
 3.x by default. (I know what default means, I mean, what changed?)

The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.

   I think part of my problem is that I'm spoiled (American, ascii heritage)
 and have been either stuck in ascii knowingly, or UTF-8 without knowing
 (just because the code points lined up). I am confused by the implications
 for using 3.x, because I am reading that there are significant things to be
 aware of... what?

Mainly Python 3 no longer does explicit conversion between bytes and
unicode, requiring the programmer to be explicit about such
conversions.  If you have Python 2 code that is sloppy about this, you
may get some Unicode encode/decode errors when trying to run the same
code in Python 3.  The 2to3 tool can help somewhat with this, but it
can't prevent all problems.

   On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
 compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
 default compile option for 2.7  3.2 (I didn't change anything) is set for
 UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?

I think that UCS-2 has always been the default unicode width for
CPython, although the exact representation used internally is an
implementation detail.

   The books say that the .py sources are UTF-8 by default... and that 3.x is
 either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
 3.x (by default) what encoding will be used, and how will that affect the
 output?

If you open a file in binary mode, the result is a non-decoded byte stream.

If you open a file in text mode and do not specify an encoding, then
the result of locale.getpreferredencoding() is used for decoding, and
the result is a unicode stream.

   If I do not specify any code points above ascii 0xFF does any of this
 matter anyway?

You mean 0x7F, and probably, due to the need to explicitly encode and decode.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Benjamin Kaplan

On Wed, May 11, 2011 at 2:37 PM, harrismh777 harrismh...@charter.net wrote:
 hi folks,
   I am puzzled by unicode generally, and within the context of python
 specifically. For one thing, what do we mean that unicode is used in python
 3.x by default. (I know what default means, I mean, what changed?)

   I think part of my problem is that I'm spoiled (American, ascii heritage)
 and have been either stuck in ascii knowingly, or UTF-8 without knowing
 (just because the code points lined up). I am confused by the implications
 for using 3.x, because I am reading that there are significant things to be
 aware of... what?

   On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
 compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
 default compile option for 2.7  3.2 (I didn't change anything) is set for
 UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?


Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

   The books say that the .py sources are UTF-8 by default... and that 3.x is
 either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
 3.x (by default) what encoding will be used, and how will that affect the
 output?

   If I do not specify any code points above ascii 0xFF does any of this
 matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

 If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread harrismh777


Ian Kelly wrote:

   Ian, Benjamin,  thanks much.


The `unicode' class was renamed to `str', and a stripped-down version
of the 2.X `str' class was renamed to `bytes'.


   ... thank you, this is very helpful.


 If I do not specify any code points above ascii 0xFF does any of this
  matter anyway?



You mean 0x7F, and probably, due to the need to explicitly encode and decode.


Yes, actually, I did... and from Benjamin's reply it seems that 
this matters only if I am working with bytes.  Is it true that if I am 
working without using bytes sequences that I will not need to care about 
the encoding anyway, unless of course I need to specify a unicode code 
point?


Thanks again.

kind regards,
m harris

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread John Machin

On Thu, May 12, 2011 8:51 am, harrismh777 wrote:
 Is it true that if I am
 working without using bytes sequences that I will not need to care about
 the encoding anyway, unless of course I need to specify a unicode code
 point?

Quite the contrary.

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. \u0404 is a Cyrillic character.



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread harrismh777


John Machin wrote:

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. \u0404 is a Cyrillic character.



Thanks John.  In reverse order, I understand point (2). I'm less clear 
on point (1).


If I generate a string of characters that I presume to be ascii/utf-8 
(no \u0404 type characters) and write them to a file (stdout) how does 
default encoding affect that file.by default..?   I'm not seeing that 
there is anything unusual going on...   If I open the file with vi?  If 
I open the file with gedit?  emacs?




Another question... in mail I'm receiving many small blocks that look 
like sprites with four small hex codes, scattered about the mail... 
mostly punctuation, maybe?   ... guessing, are these unicode code 
points, and if so what is the best way to 'guess' the encoding? ... is 
it coded in the stream somewhere...protocol?


thanks
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread MRAB


On 12/05/2011 02:22, harrismh777 wrote:

John Machin wrote:

(1) You cannot work without using bytes sequences. Files are byte
sequences. Web communication is in bytes. You need to (know / assume / be
able to extract / guess) the input encoding. You need to encode your
output using an encoding that is expected by the consumer (or use an
output method that will do it for you).

(2) You don't need to use bytes to specify a Unicode code point. Just use
an escape sequence e.g. \u0404 is a Cyrillic character.



Thanks John. In reverse order, I understand point (2). I'm less clear on
point (1).

If I generate a string of characters that I presume to be ascii/utf-8
(no \u0404 type characters) and write them to a file (stdout) how does
default encoding affect that file.by default..? I'm not seeing that
there is anything unusual going on... If I open the file with vi? If I
open the file with gedit? emacs?



Another question... in mail I'm receiving many small blocks that look
like sprites with four small hex codes, scattered about the mail...
mostly punctuation, maybe? ... guessing, are these unicode code points,
and if so what is the best way to 'guess' the encoding? ... is it coded
in the stream somewhere...protocol?


You need to understand the difference between characters and bytes.

A string contains characters, a file contains bytes.

The encoding specifies how a character is represented as bytes.

For example:

In the Latin-1 encoding, the character £ is represented by the 
byte 0xA3.


In the UTF-8 encoding, the character £ is represented by the byte 
sequence 0xC2 0xA3.


In the ASCII encoding, the character £ can't be represented at all.

The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.

A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.
--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Steven D'Aprano

On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote:

 Another question... in mail I'm receiving many small blocks that look
 like sprites with four small hex codes, scattered about the mail...
 mostly punctuation, maybe? ... guessing, are these unicode code points,
 and if so what is the best way to 'guess' the encoding? ... is it coded
 in the stream somewhere...protocol?

 You need to understand the difference between characters and bytes.


http://www.joelonsoftware.com/articles/Unicode.html

is also a good resource.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread harrismh777


Steven D'Aprano wrote:

You need to understand the difference between characters and bytes.


http://www.joelonsoftware.com/articles/Unicode.html

is also a good resource.


Thanks for being patient guys, here's what I've done:


astr=pound sign
asym= \u00A3
afile=open(myfile, mode='w')
afile.write(astr + asym)

12

afile.close()



When I edit myfile with vi I see the 'characters' :

pound sign £

   ... same with emacs, same with gedit  ...


When I hexdump myfile I see this:

000 6f70 6375 2064 6973 6e67 c220 00a3


This is *not* what I expected... well it is (little-endian) right up to 
the 'c2' and that is what is confusing me


I did not open the file with an encoding of UTF-8... so I'm assuming 
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as 
'A300' but what I got instead was UTF-8 little-endian  'c2a3' 


See my problem?... when I open the file with emacs I see the character 
pound sign... same with gedit... they're all using UTF-8 by default. By 
default it looks like Python3 is writing output with UTF-8 as default... 
and I thought that by default Python3 was using either UTF-16 or UTF-32. 
So, I'm confused here...  also, I used the character sequence \u00A3 
which I thought was UTF-16... but Python3 changed my intent to  'c2a3' 
which is the normal UTF-8...


Thanks again for your patience... I really do hate to be dense about 
this...  but this is another area where I'm just beginning to dabble and 
I'd like to know soon what I'm doing...


Thanks for the link Steve... I'm headed there now...




kind regards,
m harris



--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread John Machin

On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
 John Machin wrote:
 (1) You cannot work without using bytes sequences. Files are byte
 sequences. Web communication is in bytes. You need to (know / assume /
 be
 able to extract / guess) the input encoding. You need to encode your
 output using an encoding that is expected by the consumer (or use an
 output method that will do it for you).

 (2) You don't need to use bytes to specify a Unicode code point. Just
 use
 an escape sequence e.g. \u0404 is a Cyrillic character.


 Thanks John.  In reverse order, I understand point (2). I'm less clear
 on point (1).

 If I generate a string of characters that I presume to be ascii/utf-8
 (no \u0404 type characters)
 and write them to a file (stdout) how does
 default encoding affect that file.by default..?   I'm not seeing that
 there is anything unusual going on...

About characters that I presume to be ascii/utf-8 (no \u0404 type
characters): All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.

   If I open the file with vi?  If
 I open the file with gedit?  emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.

 Another question... in mail I'm receiving many small blocks that look
 like sprites with four small hex codes, scattered about the mail...
 mostly punctuation, maybe?   ... guessing, are these unicode code
 points,

yes

 and if so what is the best way to 'guess' the encoding?

google(chardet) or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

 ... is
 it coded in the stream somewhere...protocol?

Should be.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Ben Finney

MRAB pyt...@mrabarnett.plus.com writes:

 You need to understand the difference between characters and bytes.

Yep. Those who don't need to join us in the third millennium, and the
resources pointed out in this thread are good to help that.

 A string contains characters, a file contains bytes.

That's not true for Python 2.

I'd phrase that as:

* Text is a sequence of characters. Most inputs to the program,
  including files, sockets, etc., contain a sequence of bytes.

* Always know whether you're dealing with text or with bytes. No object
  can be both.

* In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is
  the type for text.

* In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a
  sequence of bytes.

-- 
 \  “I went to a garage sale. ‘How much for the garage?’ ‘It's not |
  `\for sale.’” —Steven Wright |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Terry Reedy


On 5/11/2011 11:44 PM, harrismh777 wrote:

Steven D'Aprano wrote:

You need to understand the difference between characters and bytes.


http://www.joelonsoftware.com/articles/Unicode.html

is also a good resource.


Thanks for being patient guys, here's what I've done:


astr=pound sign
asym= \u00A3
afile=open(myfile, mode='w')
afile.write(astr + asym)

12

afile.close()



When I edit myfile with vi I see the 'characters' :

pound sign £

... same with emacs, same with gedit ...


When I hexdump myfile I see this:

000 6f70 6375 2064 6973 6e67 c220 00a3



This is *not* what I expected... well it is (little-endian) right up to
the 'c2' and that is what is confusing me



I did not open the file with an encoding of UTF-8... so I'm assuming
UTF-16 by default (python3) so I was expecting a '00A3' little-endian as
'A300' but what I got instead was UTF-8 little-endian 'c2a3' 

See my problem?... when I open the file with emacs I see the character
pound sign... same with gedit... they're all using UTF-8 by default. By
default it looks like Python3 is writing output with UTF-8 as default...
and I thought that by default Python3 was using either UTF-16 or UTF-32.
So, I'm confused here... also, I used the character sequence \u00A3
which I thought was UTF-16... but Python3 changed my intent to 'c2a3'
which is the normal UTF-8...


If you open a file as binary (bytes), you must write bytes, and they are 
stored without transformation. If you open in text mode, you must write 
text (string as unicode in 3.2) and Python will encode to bytes using 
either some default or the encoding you specified in the open statement. 
It does not matter how Python stored the unicode internally. Does this 
help? Your intent is signalled by how you open the file.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread John Machin

On Thu, May 12, 2011 1:44 pm, harrismh777 wrote:
 By
 default it looks like Python3 is writing output with UTF-8 as default...
 and I thought that by default Python3 was using either UTF-16 or UTF-32.
 So, I'm confused here...  also, I used the character sequence \u00A3
 which I thought was UTF-16... but Python3 changed my intent to  'c2a3'
 which is the normal UTF-8...

Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode
code points. Those NN bits have nothing to do with the UTF-NN encodings,
which can be used to encode the codepoints as byte sequences for EXTERNAL
purposes. In your case, UTF-8 has been used as it is the default encoding
on your platform.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread Benjamin Kaplan

On Wed, May 11, 2011 at 8:44 PM, harrismh777 harrismh...@charter.net wrote:
 Steven D'Aprano wrote:

 You need to understand the difference between characters and bytes.

 http://www.joelonsoftware.com/articles/Unicode.html

 is also a good resource.

 Thanks for being patient guys, here's what I've done:

 astr=pound sign
 asym= \u00A3
 afile=open(myfile, mode='w')
 afile.write(astr + asym)

 12

 afile.close()


 When I edit myfile with vi I see the 'characters' :

 pound sign £

   ... same with emacs, same with gedit  ...


 When I hexdump myfile I see this:

 000 6f70 6375 2064 6973 6e67 c220 00a3


 This is *not* what I expected... well it is (little-endian) right up to the
 'c2' and that is what is confusing me

 I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16
 by default (python3) so I was expecting a '00A3' little-endian as 'A300' but
 what I got instead was UTF-8 little-endian  'c2a3' 

quick note here: UTF-8 doesn't have an endian-ness. It's always read
from left to right, with the high bit telling you whether you need to
continue or not. So it's always little endian.

 See my problem?... when I open the file with emacs I see the character pound
 sign... same with gedit... they're all using UTF-8 by default. By default it
 looks like Python3 is writing output with UTF-8 as default... and I thought
 that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused
 here...  also, I used the character sequence \u00A3 which I thought was
 UTF-16... but Python3 changed my intent to  'c2a3' which is the normal
 UTF-8...


The fact that CPython uses UCS-2 or UCS-4 internally is an
implementation detail and isn't actually part of the Python
specification. As far as a Python program is concerned, a Unicode
string is a list of character objects, not bytes. Much like any other
object, a unicode character needs to be serialized before it can be
written to a file. An encoding is a serialization function for
characters.

If the file you're writing to doesn't specify an encoding, Python will
default to locale.getdefaultencoding(), which tries to get your
system's preferred encoding from environment variables (in other
words, the same source that emacs and gedit will use to get the
default encoding).
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode by default

2011-05-11 Thread John Machin

On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote:

 If the file you're writing to doesn't specify an encoding, Python will
 default to locale.getdefaultencoding(),

No such attribute. Perhaps you mean locale.getpreferredencoding()



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-30 Thread Gabriel Genellina


En Fri, 30 Oct 2009 13:40:14 -0300, zooko zoo...@gmail.com escribió:

On Oct 20, 9:50 pm, Gabriel Genellina gagsl-...@yahoo.com.ar
wrote:


DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems.


I'm not convinced.  I've read all of the posts and web pages and blog
entries decrying this practice over the last several years, but as far
as I can tell the actual harm that can result is limited (as long as
you set it to utf-8) and the practical benefits are substantial.  This
is a pattern that I have no problem using:

import sys
reload(sys)
sys.setdefaultencoding(utf-8)

The reason this doesn't cause too much harm is that anything that
would have worked with the original default encoding ('ascii') will
also work with the new utf-8 default encoding.


Wrong. Dictionaries may start behaving incorrectly, by example. Normally,  
two keys that compare equal cannot coexist in the same dictionary:


py 1 == 1.0
True
py d = {}
py d[1] = '*'
py d[1.0]
'*'
py d[1.0] = '$'
py d
{1: '$'}

1 and 1.0 are the same key, as far as the dictionary is concerned. For  
this to work, both keys must have the same hash:


py hash(1) == hash(1.0)
True

Now, let's set the default encoding to utf-8:

py import sys
py reload(sys)
module 'sys' (built-in)
py sys.setdefaultencoding('utf-8')
py x = u'á'
py y = u'á'.encode('utf-8')
py x
u'\xe1'
py y
'\xc3\xa1'

(same as y='á' if the source encoding is set to utf-8, but I don't want to  
depend on that). Just to be sure we're dealing with the right character:


py import unicodedata
py unicodedata.name(x)
'LATIN SMALL LETTER A WITH ACUTE'
py unicodedata.name(y.decode('utf-8'))
'LATIN SMALL LETTER A WITH ACUTE'

Now, we can see that both x and y are equal:

py x == y
True

x is an accented a, y is the same thing encoded using the default  
encoding, both are equal. Fine. Now create a dictionary:


py d = {}
py d[x] = '*'
py d[x]
'*'
py x in d
True
py y in d
False# ???
py d[y] = 2
py d
{u'\xe1': '*', '\xc3\xa1': 2} # 

Since x==y, one should expect a single entry in the dictionary - but we  
got two. That's because:


py x == y
True
py hash(x) == hash(y)
False

and this must *not* happen according to  
http://docs.python.org/reference/datamodel.html#object.__hash__
The only required property is that objects which compare equal have the  
same hash value
Considering that dictionaries in Python are used almost everywhere,  
breaking this basic asumption is a really bad problem.


Of course, all of this applies to Python 2.x; in Python 3.0 the problem  
was solved differently; strings are unicode by default, and the default  
encoding IS utf-8.



As far as I've seen
from the aforementioned mailing list threads and blog posts and so on,
the worst thing that has ever happened as a result of this technique
is that something works for you but fails for someone else who doesn't
have this stanza.  (http://tarekziade.wordpress.com/2008/01/08/
syssetdefaultencoding-is-evil/ .)  That's bad, but probably just
including this stanza at the top of the file that you are sharing with
that other person instead of doing it in a sitecustomize.py file will
avoid that problem.


And then you break all other libraries that the program is using,  
including the Python standard library, because the default encoding is a  
global setting. What if another library decides to use latin-1 as the  
default encoding, using the same trick? Latest one wins...


You said the practical benefits are substantial but I, for myself,  
cannot see any benefit. Perhaps if you post your real problems, someone  
can find the solution.
The right way is to fix your program to do the right thing, not to hide  
the bugs under the rug.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-22 Thread Lele Gaifax

Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it
 escribió:

 Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 nosetest should do nothing special. You should configure the environment
 so Python *knows* that your console understands utf-8. Once Python is
 aware of the *real* encoding your console is using, sys.stdout.encoding
 will be utf-8 automatically and your problem is solved. I don't know how
 to do that within virtualenv, but the answer certainly does NOT involve
 sys.setdefaultencoding()

 On Windows, a normal console window on my system uses cp850:


 D:\USERDATA\Gabrielchcp
 Tabla de códigos activa: 850

 D:\USERDATA\Gabrielpython
 Python 2.6.3 (r263rc1:75186, Oct  2 2009, 20:40:30) [MSC v.1500 32 bit
 (Intel)]
 on win32
 Type help, copyright, credits or license for more information.
 py import sys
 py sys.getdefaultencoding()
 'ascii'
 py sys.stdout.encoding
 'cp850'
 py u = uáñç
 py print u
 áñç

This is the same on my virtualenv:

$ python -c import sys; print sys.getdefaultencoding(), 
sys.stdout.encoding
ascii UTF-8
$ python -c print u'\xe1\xf1\xe7'
áñç

But look at this:

$ cat test.py
# -*- coding: utf-8 -*-

class TestAccents(object):
u'\xe1\xf1\xe7'

def test_simple(self):
u'cioè'

pass


$ nosetests  test.py
.
--
Ran 1 test in 0.002s

OK

$ nosetests -v test.py
ERROR

==
Traceback (most recent call last):
  File /tmp/env/bin/nosetests, line 8, in module
load_entry_point('nose==0.11.1', 'console_scripts', 'nosetests')()
  File 
/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 
113, in __init__
argv=argv, testRunner=testRunner, testLoader=testLoader)
  File /usr/lib/python2.6/unittest.py, line 817, in __init__
self.runTests()
  File 
/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 
192, in runTests
result = self.testRunner.run(self.test)
  File 
/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 
63, in run
result.printErrors()
  File 
/tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/result.py, 
line 81, in printErrors
_TextTestResult.printErrors(self)
  File /usr/lib/python2.6/unittest.py, line 724, in printErrors
self.printErrorList('ERROR', self.errors)
  File /usr/lib/python2.6/unittest.py, line 730, in printErrorList
self.stream.writeln(%s: %s % (flavour,self.getDescription(test)))
  File /usr/lib/python2.6/unittest.py, line 665, in writeln
if arg: self.write(arg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in 
position 10: ordinal not in range(128)

Who is the culprit here?

The fact is, encodings are the real Y2k problem, and they are here to
stay for a while!

thank you,
ciao, lele.
-- 
nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia.
l...@nautilus.homeip.net | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-22 Thread Gabriel Genellina

En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax l...@metapensiero.it  
escribió:

Gabriel Genellina gagsl-...@yahoo.com.ar writes:


En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it
escribió:


Gabriel Genellina gagsl-...@yahoo.com.ar writes:


nosetest should do nothing special. You should configure the environment
so Python *knows* that your console understands utf-8.



This is the same on my virtualenv:

$ python -c import sys; print sys.getdefaultencoding(),  
sys.stdout.encoding

ascii UTF-8
$ python -c print u'\xe1\xf1\xe7'
áñç


Good, so stdout's encoding isn't really the problem.


But look at this:

  File /usr/lib/python2.6/unittest.py, line 730, in printErrorList
self.stream.writeln(%s: %s %  
(flavour,self.getDescription(test)))

  File /usr/lib/python2.6/unittest.py, line 665, in writeln
if arg: self.write(arg)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in  
position 10: ordinal not in range(128)


Who is the culprit here?


unittest, or ultimately, this bug: http://bugs.python.org/issue4947

This is not specific to nosetest; unittest in verbose mode fails in the  
same way.


fix: add this method to the _WritelnDecorator class in unittest.py (near  
line 664):


def write(self, arg):
if isinstance(arg, unicode):
arg = arg.encode(self.stream.encoding, replace)
self.stream.write(arg)


The fact is, encodings are the real Y2k problem, and they are here to
stay for a while!


Ok, but the idea is to solve the problem (or not let it happen in the  
first place!), not hide it under the rug :)


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-22 Thread Lele Gaifax

Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax l...@metapensiero.it
 escribió:
 Who is the culprit here?

 unittest, or ultimately, this bug: http://bugs.python.org/issue4947

Thank you. In particular I found
http://bugs.python.org/issue4947#msg87637 as the best fit, I think
that may be what's happening here.

 fix: add this method to the _WritelnDecorator class in unittest.py
 (near line 664):

 def write(self, arg):
 if isinstance(arg, unicode):
 arg = arg.encode(self.stream.encoding, replace)
 self.stream.write(arg)

Uhm, that's almost as dirty as my reload(), you must admit! :-)

bye, lele.
-- 
nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia.
l...@nautilus.homeip.net | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-22 Thread Wolodja Wentland

On Thu, Oct 22, 2009 at 13:59 +0200, Lele Gaifax wrote:
 Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 unittest, or ultimately, this bug: http://bugs.python.org/issue4947
 http://bugs.python.org/issue4947#msg87637 as the best fit, I think

You might also want to have a look at:

http://bugs.python.org/issue1293741

I hope this helps and that these bugs will be solved soon.

Wolodja


signature.asc
Description: Digital signature
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-21 Thread Lele Gaifax

Gabriel Genellina gagsl-...@yahoo.com.ar writes:

 DON'T do that. Really. Changing the default encoding is a horrible,
 horrible hack and causes a lot of problems.
 ...
 More reasons:
 http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
 See also this recent thread in python-dev:
 http://comments.gmane.org/gmane.comp.python.devel/106134

This is a problem that appears quite often, against which I have yet to
see a general workaround, or even a safe pattern. I must confess that
most often I just give up and change the if 0: line in
sitecustomize.py to enable a reasonable default...

A week ago I met another incarnation of the problem that I finally
solved by reloading the sys module, a very ugly way, don't tell me, and
I really would like to know a better way of doing it.

The case is simple enough: a unit test started failing miserably, with a
really strange traceback, and a quick pdb session revealed that the
culprit was nosetest, when it prints out the name of the test, using
some variant of print testfunc.__doc__: since the latter happened to
be a unicode string containing some accented letters, that piece of
nosetest's code raised an encoding error, that went untrapped...

I tried to understand the issue, until I found that I was inside a fresh
new virtualenv with python 2.6 and the sitecustomize wasn't even
there. So, even if my shell environ was UTF-8 (the system being a Ubuntu
Jaunty), within that virtualenv Python's stdout encoding was
'ascii'. Rightly so, nosetest failed to encode the accented letters to
that.

I could just rephrase the test __doc__, or remove it, but to avoid
future noise I decided to go with the deprecated reload(sys) trick,
done as early as possible... damn, it's just a test suite after all!

Is there a correct way of dealing with this? What should nosetest
eventually do to initialize it's sys.output.encoding reflecting the
system's settings? And on the user side, how could I otherwise fix it (I
mean, without resorting to the reload())?

Thank you,
ciao, lele.
-- 
nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia.
l...@nautilus.homeip.net | -- Fortunato Depero, 1929.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-21 Thread Gabriel Genellina


En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it
escribió:


Gabriel Genellina gagsl-...@yahoo.com.ar writes:


DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems.
...
More reasons:
http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
See also this recent thread in python-dev:
http://comments.gmane.org/gmane.comp.python.devel/106134


This is a problem that appears quite often, against which I have yet to
see a general workaround, or even a safe pattern. I must confess that
most often I just give up and change the if 0: line in
sitecustomize.py to enable a reasonable default...

A week ago I met another incarnation of the problem that I finally
solved by reloading the sys module, a very ugly way, don't tell me, and
I really would like to know a better way of doing it.

The case is simple enough: a unit test started failing miserably, with a
really strange traceback, and a quick pdb session revealed that the
culprit was nosetest, when it prints out the name of the test, using
some variant of print testfunc.__doc__: since the latter happened to
be a unicode string containing some accented letters, that piece of
nosetest's code raised an encoding error, that went untrapped...

I tried to understand the issue, until I found that I was inside a fresh
new virtualenv with python 2.6 and the sitecustomize wasn't even
there. So, even if my shell environ was UTF-8 (the system being a Ubuntu
Jaunty), within that virtualenv Python's stdout encoding was
'ascii'. Rightly so, nosetest failed to encode the accented letters to
that.


That seems to imply that in your normal environment you altered the
default encoding to utf-8 -- if so: don't do that!


I could just rephrase the test __doc__, or remove it, but to avoid
future noise I decided to go with the deprecated reload(sys) trick,
done as early as possible... damn, it's just a test suite after all!

Is there a correct way of dealing with this? What should nosetest
eventually do to initialize it's sys.output.encoding reflecting the
system's settings? And on the user side, how could I otherwise fix it (I
mean, without resorting to the reload())?


nosetest should do nothing special. You should configure the environment
so Python *knows* that your console understands utf-8. Once Python is
aware of the *real* encoding your console is using, sys.stdout.encoding
will be utf-8 automatically and your problem is solved. I don't know how
to do that within virtualenv, but the answer certainly does NOT involve
sys.setdefaultencoding()

On Windows, a normal console window on my system uses cp850:


D:\USERDATA\Gabrielchcp
Tabla de códigos activa: 850

D:\USERDATA\Gabrielpython
Python 2.6.3 (r263rc1:75186, Oct  2 2009, 20:40:30) [MSC v.1500 32 bit
(Intel)]
on win32
Type help, copyright, credits or license for more information.
py import sys
py sys.getdefaultencoding()
'ascii'
py sys.stdout.encoding
'cp850'
py u = uáñç
py print u
áñç
py u
u'\xe1\xf1\xe7'
py u.encode(cp850)
'\xa0\xa4\x87'
py import unicodedata
py unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

I opened another console, changed the code page to 1252 (the one used in
Windows applications; `chcp 1252`) and invoked Python again:

py import sys
py sys.getdefaultencoding()
'ascii'
py sys.stdout.encoding
'cp1252'
py u = uáñç
py print u
áñç
py u
u'\xe1\xf1\xe7'
py u.encode(cp1252)
'\xe1\xf1\xe7'
py import unicodedata
py unicodedata.name(u[0])
'LATIN SMALL LETTER A WITH ACUTE'

As you can see, everything works fine without any need to change the
default encoding... Just make sure Python *knows* which encoding is being
used in the console on which it runs. On Ubuntu you may need to set the
LANG environment variable.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode again ... default codec ...

2009-10-20 Thread Gabriel Genellina


En Tue, 20 Oct 2009 17:13:52 -0300, Stef Mientki stef.mien...@gmail.com
escribió:


Form the thread how to write a unicode string to a file ?
and my specific situation:

- reading data from Excel, Delphi and other Windows programs and unicode  
Python

- using wxPython, which forces unicode
- writing to Excel and other Windows programs

almost all answers, directed to the following solution:
- in the python program, turn every string as soon as possible into  
unicode

- in Python all processing is done in unicode
- at the end, translate unicode into the windows specific character set  
(if necessary)


Yes. That's the way to go; if you follow the above guidelines when working
with character data, you should not encounter big unicode problems.


The above approach seems to work nicely,
but manipulating heavily with string like objects it's a crime.
It's impossible to change all my modules from strings to unicode at once,
and it's very tempting to do it just the opposite : convert everything  
into strings !


Wide is the road to hell...


# adding unicode string and windows strings, results in an error:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + my_u


(I guess you meant my_w + my_u). Formally:

x = my_w.decode('windows-1252') + my_u  # [1]

but why are you using a byte string in the first place? Why not:

my_w = u'my_w' + u'ö'

so you can compute my_w + my_u directly?

# to correctly handle the above ( in my situation), I need to write the  
following code (which my code quite unreadable

my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = unicode ( my_s, 'windows-1252' )  + my_u

# converting to strings gives much better readable code:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + str(my_u)


But it's not the same thing, i.e., in the former case x is an unicode
object, in the later x is a byte string. Also, str(my_u) only works if it
contains just ascii characters. The counterpart of my code [1] above would
be:

x = my_w + my_u.encode('windows-1252')

That is, you use some_unicode_object.encode(desired-encoding) to do the
unicode-bytestring conversion, and
some_string_object.decode(known-encoding) to convert in the opposite
sense.


until I found this website:
  http://diveintopython.org/xml_processing/unicode.html

By settings the default encoding:
I now can go to unicode much more elegant and almost fully automatically:
(and I guess the writing to a file problem is also solved)
# now the manipulations of strings and unicode works OK:
my_u = u'my_u'
my_w = 'my_w' + chr ( 246 )
x = my_s + my_u

The only disadvantage is that you've to put a special named file into  
the Python directory !!

So if someone knows a more elegant way to set the default codec,
I would be much obliged.


DON'T do that. Really. Changing the default encoding is a horrible,
horrible hack and causes a lot of problems. 'Dive into Python' is a great
book, but suggesting to alter the default character encoding is very, very
bad advice:

   - site.py and sitecustomize.py contain *global* settings, affecting  
*all*

users and *all* scripts running on that machine. Other users may get very
angry at you when their own programs break or give incorrect results when
run with a different encoding.
   - you must have administrative rights to alter those files.
   - you won't be able to distribute your code, since almost everyone else
in the world won't be using *your* default encoding.
   - what if another library/package/application wants to set a different
default encoding?
   - the default encoding for Python=3.0 is now 'utf-8' instead of 'ascii'

More reasons:
http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/
See also this recent thread in python-dev:
http://comments.gmane.org/gmane.comp.python.devel/106134

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list

39 matches

Mail list logo