subject:"python 2.7 and unicode \(one more time\)"

Re: python 2.7 and unicode (one more time)

2014-12-02 Thread Simon Evans


Hi Peter Otten
re:

There is no assignment 

soup_atag = whatever 

but there is one to atag. The whole session should when you omit the 
offending line 

> atag = soup_atag.a 

or insert 

soup_atag = soup 

before it. 

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> html_atag = """Test html a tag example
... http://www.packtpub.com'>Home
... >> soup = BeautifulSoup(html_atag,'lxml')
>>> atag = soup.aprint(atag)
>>> atag = soup.a
>>> print(atag)
http://www.packtpub.com'>Home


>>> type(atag)

>>> tagname = atag.name
>>> print tagname
a
>>> atag.name = 'p'
>>> print (soup)
Test html a tag example
http://www.packtpub.com'>Home



>>> atag.name = 'p'
>>> print(soup)
Test html a tag example
http://www.packtpub.com'>Home



>>> atag.name = 'a'
>>> print(soup)
Test html a tag example
http://www.packtpub.com'>Home



>>> soup_atag = soup
>>> atag = soup_atag.a
>>> print (atag['href'])
http://www.packtpub.com'>Home
>>

Thank you.
Yours
Simon.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-25 Thread Chris Angelico

On Tue, Nov 25, 2014 at 10:56 PM, Steven D'Aprano
 wrote:
> I think this conversation is going nowhere, so it's probably best to end it.

\0

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-25 Thread Steven D'Aprano

Marko Rauhamaa wrote:

> Steven D'Aprano :
> 
>> Marko Rauhamaa wrote:
>>
 Py3's byte strings are still strings, though.
>>> 
>>> Hm. I don't think so. In a plain English sense, maybe, but that kind of
>>> usage can lead to confusion.
>>
>> Only if you are determined to confuse yourself.
>>
>> {...]
>>
>> In Python usage, "string" always refers to the `str` type, unless
>> prefixed with "byte", in which case it refers to the immutable
>> byte-string type (`str` in Python 2, `bytes` in Python 3.)
> 
> You are saying what I'm saying.
> 
> Byte strings are *not* strings.

Of course they are. They are strings of bytes, just as the name suggests.

> Prairie dogs are not dogs. No need to call dogs "domesticated dogs" to
> tell them apart from "prairie dogs".

But wild dogs *are* dogs, and there is a need to distinguish between wild
dogs and domesticated dogs. 

Just as there is a need to distinguish between byte strings, ASCII strings,
Latin-1 strings, Big5 strings, Unicode strings, Tron strings and cheese
strings.

I think this conversation is going nowhere, so it's probably best to end it.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Marko Rauhamaa

Steven D'Aprano :

> Marko Rauhamaa wrote:
>
>>> Py3's byte strings are still strings, though.
>> 
>> Hm. I don't think so. In a plain English sense, maybe, but that kind of
>> usage can lead to confusion.
>
> Only if you are determined to confuse yourself.
>
> {...]
>
> In Python usage, "string" always refers to the `str` type, unless
> prefixed with "byte", in which case it refers to the immutable
> byte-string type (`str` in Python 2, `bytes` in Python 3.)

You are saying what I'm saying.

Byte strings are *not* strings.

Prairie dogs are not dogs. No need to call dogs "domesticated dogs" to
tell them apart from "prairie dogs".


Marko


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Chris Angelico

On Tue, Nov 25, 2014 at 9:56 AM, Steven D'Aprano
 wrote:
> In all cases apart from an explicit "byte string", the word "string" is
> always used for the native array-of-characters type delimited by plain
> quotation marks, as used for error messages, user prompts, etc., regardless
> whether the implementation is an array of 8-bit bytes (as used by Python
> 2), or the full Unicode character set (as used by Python 3). So in
> practice, provided you know which version of Python is being discussed,
> there is never any genuine ambiguity when using the word "string" and no
> excuse for confusion.

And frequently, even if you're talking about Py2/Py3 cross code,
there's still no ambiguity about the word "string": it means a
default-for-the-language string. The locale.setlocale() function
expects a string as its second parameter, for instance. (And
unfortunately, flatly refuses the other sort, whichever way around
that is.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Steven D'Aprano

Marko Rauhamaa wrote:

>> Py3's byte strings are still strings, though.
> 
> Hm. I don't think so. In a plain English sense, maybe, but that kind of
> usage can lead to confusion.

Only if you are determined to confuse yourself.

People are quite capable of interpreting correctly sentences like:

"My friend Susan and I were talking about Jenny, and she said that she had
had a horrible fight with her boyfriend and was breaking up with him."

and despite the ambiguity correctly interpret who "she" and "her" refers to
each time. Compared to that, correctly understanding the mild complexity
of "string" is trivial.

In Python usage, "string" always refers to the `str` type, unless prefixed
with "byte", in which case it refers to the immutable byte-string type
(`str` in Python 2, `bytes` in Python 3.)

"Unicode string" always refers to the immutable Unicode string type
(`unicode` in Python 2, `str` in Python 3).

"Text string" is more ambiguous. Some people consider the prefix to be
redundant, e.g. "text string" always refers to `str`, while others consider
it to be in opposition to "byte string", i.e. to be a synonym for "Unicode
string".

In all cases apart from an explicit "byte string", the word "string" is
always used for the native array-of-characters type delimited by plain
quotation marks, as used for error messages, user prompts, etc., regardless
whether the implementation is an array of 8-bit bytes (as used by Python
2), or the full Unicode character set (as used by Python 3). So in
practice, provided you know which version of Python is being discussed,
there is never any genuine ambiguity when using the word "string" and no
excuse for confusion.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Marko Rauhamaa

Chris Angelico :

> Py3's byte strings are still strings, though.

Hm. I don't think so. In a plain English sense, maybe, but that kind of
usage can lead to confusion.

For example,

   A subscription selects an item of a sequence (string, tuple or list)
   or mapping (dictionary) object:

   subscription ::=  primary "[" expression_list "]"

   [...]

   A string’s items are characters. A character is not a separate data
   type but a string of exactly one character.

   https://docs.python.org/3/reference/expressions.html#subscripti
   ons>


The text is probably a bit buggy since it skates over bytes and byte
arrays listed as sequences (by https://docs.python.org/3/reference/datamodel.html>). However, your
Python3 implementation would fail if it interpreted bytes objects to be
strings in the above paragraph:

   >>> "abc"[1]
   'b'
   >>> b'abc'[1]
   98

The subscription of a *string* evaluates to a *string*. The subscription
of a *bytes* object evaluates to a *number*.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 5:57 PM, Marko Rauhamaa  wrote:
> Yes, people call strings "Unicdoe strings" because Python2 *did have*
> unicode strings separate from regular strings:
>
> Python2Python3
> --
> string bytes (byte string)
> unicode string string
>
>
> In Python2 days, Unicode was a fancy, exotic datatype for the
> connoisseurs. The rest used strings. Python3 supposedly elevates Unicode
> to boring normalcy. Now it's bytes that have fallen into (unmerited)
> disfavor.

Py3's byte strings are still strings, though. People don't use
bytearray for everything.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Marko Rauhamaa

Gregory Ewing :
> Marko Rauhamaa wrote:
>> Unicode strings is not wrong but the technical emphasis on Unicode is as
>> strange as a "tire car" or "rectangular door" when "car" and "door" are
>> what you usually mean.
>
> The reason Unicode gets emphasised so much is that until relatively
> recently, it *wasn't* what "string" usually meant in Python.
>
> When Python 3 has been around for as long as Python 2 was, things may
> change.

Yes, people call strings "Unicdoe strings" because Python2 *did have*
unicode strings separate from regular strings:

Python2Python3
--
string bytes (byte string)
unicode string string

In Python2 days, Unicode was a fancy, exotic datatype for the
connoisseurs. The rest used strings. Python3 supposedly elevates Unicode
to boring normalcy. Now it's bytes that have fallen into (unmerited)
disfavor.

But old habits die hard; you call cars "automobile cars" instead of
"cars" since, after all, "cars" were always pulled by horses...

Marko

PS Maybe interestingly, Guile went through an analogous transition. As
of Guile 2.0,

  a character is anything in the Unicode Character Database.
  [...]
  Strings are fixed-length sequences of characters.
  [...]
  A bytevector is a raw bit string.

  https://www.gnu.org/software/guile/manual/html_node/index.html>

However, Guile 1.8 still had:

  The Guile implementation of character sets currently deals only with
  8-bit characters.

  https://www.gnu.org/software/guile/docs/docs-1.8/guile-ref/inde
  x.html>

and there were no bytevectors.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread random832

On Sun, Nov 23, 2014, at 15:31, Dave Angel wrote:
> I didn't realize Windows shell (DOS box) had that bug.  Course I don't 
> use Windows much the last few years.
> 
> it's one thing to not display it properly.  It's quite another to supply 
> faulty data to the clipboard.  Especially since the Windows clipboard 
> has a separate Unicode type available.

It's because console bitmap fonts almost always (always?) only have one
codepage's worth of characters, and it's considered better to display A
for U+0100 than a blank space, and the clipboard has always been a bit
of an afterthought for the windows console. Meanwhile, a truetype font
is considered likely to have real glyphs for most characters a user
would want to display, so no conversion is done. And there's no font
rendering routine for bitmap fonts that will allow for dynamic
substitution of glyphs, so it becomes a real A (or whatever) in the
console buffer itself - this isn't a conversion done at clipboard-copy
time.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 9:51 AM, Gregory Ewing
 wrote:
> Marko Rauhamaa wrote:
>>
>> Unicode strings is not wrong but the technical emphasis on Unicode is as
>> strange as a "tire car" or "rectangular door" when "car" and "door" are
>> what you usually mean.
>
>
> The reason Unicode gets emphasised so much is that
> until relatively recently, it *wasn't* what "string"
> usually meant in Python.
>
> When Python 3 has been around for as long as Python
> 2 was, things may change.

I doubt it; the bytes() type is sufficiently stringy to require the
distinction to still be made. PEP 461 makes it clear that byte strings
are not blobs of opaque data, but are very definitely ASCII-compatible
objects, for the benefit of boundary code.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Gregory Ewing


Marko Rauhamaa wrote:

Unicode strings is not wrong but the technical emphasis on Unicode is as
strange as a "tire car" or "rectangular door" when "car" and "door" are
what you usually mean.


The reason Unicode gets emphasised so much is that
until relatively recently, it *wasn't* what "string"
usually meant in Python.

When Python 3 has been around for as long as Python
2 was, things may change.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 7:31 AM, Dave Angel  wrote:
> On 11/23/2014 01:13 PM, random...@fastmail.us wrote:
>>
>> On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:
>>>
>>> Why would that be possible? Many truetype fonts only supply
>>> glyphs for
>>> single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
>>> character map utility and see what some of the font files contain.
>>
>>
>> With a bitmap font selected, the characters will be immediately replaced
>> with characters present in the font's codepage, and will copy to
>> clipboard as such.
>
>
> I didn't realize Windows shell (DOS box) had that bug.  Course I don't use
> Windows much the last few years.

Likewise. I've been accustomed to copying and pasting unrecognized
characters (one of the easiest solutions is to paste them into a
Python console - ord() for one character, or a Py2 repr() for multiple
- to quickly see what the codepoints are), relying on the clipboard
getting the exact same sequence that was printed by the application.
Thanks, Windows, just what I always wanted to hear.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Dave Angel


On 11/23/2014 01:13 PM, random...@fastmail.us wrote:

On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:

Why would that be possible? Many truetype fonts only supply glyphs for
single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
character map utility and see what some of the font files contain.


With a bitmap font selected, the characters will be immediately replaced
with characters present in the font's codepage, and will copy to
clipboard as such.


I didn't realize Windows shell (DOS box) had that bug.  Course I don't 
use Windows much the last few years.


it's one thing to not display it properly.  It's quite another to supply 
faulty data to the clipboard.  Especially since the Windows clipboard 
has a separate Unicode type available.


--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread random832

On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:
>   Why would that be possible? Many truetype fonts only supply glyphs for
> single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
> character map utility and see what some of the font files contain.

With a bitmap font selected, the characters will be immediately replaced
with characters present in the font's codepage, and will copy to
clipboard as such.

With a truetype font (Lucida Console or Consolas) selected, the
characters will be displayed as replacement glyphs (box with a question
mark in it) if not present in the font, but *will still copy to the
clipboard as the original code point* (which you might notice is where
we started, with someone claiming success by being able to do so with
codepage 65001 selected). And in any case, all characters that *are* in
the font will work and display correctly, rather than only those in the
OEM codepage.

>   Heck -- on my current machine, the True Type fonts are all old
> third-party items. All the standard fonts are now Open Type.

The win32 console's configuration UI refers to opentype fonts as
truetype. Opentype fonts can use either truetype or type 1 as the
underlying format, and all opentype fonts supplied with windows use
truetype. You are being excessively pedantic in objecting to my use of
the term "truetype".
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 3:33 AM, Dennis Lee Bieber
 wrote:
> On Sat, 22 Nov 2014 20:52:37 -0500, random...@fastmail.us declaimed the
> following:
>
>>On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
>>> ...
>>> That is a standard Windows build. He is again conflating problems with
>>> using the Windows command line for a given code page with the FSR.
>>
>>The thing is, with a truetype font selected, a correctly written win32
>>console problem should be able to print any character without caring
>
> Why would that be possible? Many truetype fonts only supply glyphs for
> single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
> character map utility and see what some of the font files contain.

A program should be able to print those characters even if they all
look identical. Chances are you can copy and paste them into something
else. But yes, finding a suitable font that covers the whole Unicode
range is *hard*. I've struggled with this one with a few programs (and
I still haven't managed to get VLC to satisfactorily display subtitles
that include Chinese characters).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 5:17 PM, Steven D'Aprano
 wrote:
> If Python treated the character set as an implementation detail, the
> programmer would have no way of knowing whether
>
> s = u"ö"
>
> is legal or not, since you cannot know whether or not ö is a supported
> character in the running Python. It might work on your system, and fail for
> other people. That is worse than the old distinction between "narrow"
> and "wide" builds. It would be a lazy and stupid design, and especially
> stupid since there really in no good alternative to Unicode today. ASCII is
> not even sufficient for American English, the whole Windows code page idea
> is a horrible mess, none of the legacy encodings are suitable for more than
> a tiny fraction of the world.

(Code pages aren't a Windows concept, of course, though I guess that's
the main place where they're found on PCs today.)

The only trouble with enforcing Unicode is Japanese encodings and the
whole Han unification debate. Ultimately, you have to pick a side: are
you siding with those who say there are fewer characters with multiple
forms, or with those who say there are more distinct characters? If
the former, go with Unicode. If the latter, be prepared to do heaps of
work yourself, and probably be stuck with supporting only Japanese,
because encodings like Shift-JIS aren't going to be able to represent
Scandinavian text.

Me, I'm siding with Unicode. The politicking of Han unification
doesn't interest me, so I'm happy to accept a position that says that
they're all the same character, just as the Roman letter A can be used
in English, Italian, German, Swedish, etc, etc, etc (maybe with some
combining characters for diacriticals). That gives me access to all
the world's languages with a single character set and some trustworthy
encodings. I think it's a fine trade-off: philosophy I don't care
about versus correctness in my code.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Steven D'Aprano

random...@fastmail.us wrote:

> On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote:
>> I really don't understand what bothers you about this. In Python, we have
>> Unicode strings and byte strings. In computing in general, strings can
>> consist of Unicode characters, ASCII characters, Tron characters, EBCDID
>> characters, ISO-8859-7 characters, and literally dozens of others. It
>> boogles my mind that you are so opposed to being explicit about what sort
>> of string we are dealing with.
> 
> I think he means that it should be implementation-defined with an API
> that does not allow programs to make assumptions about the encoding,
> like C. To allow for implementations that use a different character set.

Python is not C, and doesn't make every second thing undefined behaviour.

If Python treated the character set as an implementation detail, the
programmer would have no way of knowing whether

s = u"ö"

is legal or not, since you cannot know whether or not ö is a supported
character in the running Python. It might work on your system, and fail for
other people. That is worse than the old distinction between "narrow"
and "wide" builds. It would be a lazy and stupid design, and especially
stupid since there really in no good alternative to Unicode today. ASCII is
not even sufficient for American English, the whole Windows code page idea
is a horrible mess, none of the legacy encodings are suitable for more than
a tiny fraction of the world.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Sat, Nov 22, 2014, at 21:11, Chris Angelico wrote:
> Is that true? Does WriteConsoleW support every Unicode character? It's
> not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe
> something else).

I was defining "every unicode character" loosely. There are certainly
display problems (there are display problems with wide characters on
non-CJK windows versions, too), but if you write a surrogate pair,
you'll get something that can copy to the clipboard as a surrogate pair,
and get the same thing that writing a non-BMP UTF-8 character with
codepage 65001 will give you. And you certainly won't get an error.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 12:52 PM,   wrote:
> On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
>> ...
>> That is a standard Windows build. He is again conflating problems with
>> using the Windows command line for a given code page with the FSR.
>
> The thing is, with a truetype font selected, a correctly written win32
> console problem should be able to print any character without caring
> about codepages (via use of WriteConsoleW instead of WriteFile). You
> cannot rely on having the codepage set to 65001, especially since 65001
> isn't actually a fully supported codepage.

Is that true? Does WriteConsoleW support every Unicode character? It's
not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe
something else).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
> ...
> That is a standard Windows build. He is again conflating problems with 
> using the Windows command line for a given code page with the FSR.

The thing is, with a truetype font selected, a correctly written win32
console problem should be able to print any character without caring
about codepages (via use of WriteConsoleW instead of WriteFile). You
cannot rely on having the codepage set to 65001, especially since 65001
isn't actually a fully supported codepage.

In my opinion it is a deficiency in the win32 support, rather than
unicode support (and certainly nothing to do with the FSR), but it _is_
a deficiency.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote:
> I really don't understand what bothers you about this. In Python, we have
> Unicode strings and byte strings. In computing in general, strings can
> consist of Unicode characters, ASCII characters, Tron characters, EBCDID
> characters, ISO-8859-7 characters, and literally dozens of others. It
> boogles my mind that you are so opposed to being explicit about what sort
> of string we are dealing with.

I think he means that it should be implementation-defined with an API
that does not allow programs to make assumptions about the encoding,
like C. To allow for implementations that use a different character set.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence

On 22/11/2014 22:31, Chris Angelico wrote:

On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence  wrote:

My favourite "find thousand and one ways to make Python crashing or
failing." but I don't recall a single bug report in the last two years from
anybody regarding problems with the FSR, or have I missed something?

What you've missed is the grammar of the sentence you've (partially)
quoted. Clearly he is seeking to make Python, and he is crashing or
failing. My advice to him: Stop trying to build complex software while
in command of a car.

ChrisA

What?  The entire message follows.

I think you are not understanding the point very well.

Py32 and Qt derivative + plenty of dirty tricks.
(It will probably not be rendered correctly.)

Write something like this (an interactive interpreter)
in Py32 and Py33 and see what happens:

>>> print(999)
999
>>> sys.version
'3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]'
>>> # note the emoji and the private use area (plane 15)
>>> a = 'abc\u00e9\u0153\u20ac\u1e9e\U0001f300\udb80\udc00z'
>>> print(a)
abcéœ€ẞ🌀󰀀z
>>>

Note: it can be "cut/copied/pasted" with a MS product.

jmf

PS I have to recognized, I'm slowly getting tired to
find thousand and one ways to make Python crashing
or failing.

That is a standard Windows build. He is again conflating problems with 
using the Windows command line for a given code page with the FSR.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence  wrote:
> My favourite "find thousand and one ways to make Python crashing or
> failing." but I don't recall a single bug report in the last two years from
> anybody regarding problems with the FSR, or have I missed something?

What you've missed is the grammar of the sentence you've (partially)
quoted. Clearly he is seeking to make Python, and he is crashing or
failing. My advice to him: Stop trying to build complex software while
in command of a car.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence


On 22/11/2014 20:17, Chris Angelico wrote:

On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence  wrote:

Please don't feed him.  Your average troll is bad enough but he really takes
the biscuit.


... someone was feeding him biscuits?

ChrisA



Surely it's better than feeding him unicode?

As I needed cheering up I ventured over to gg and wasn't disappointed 
reading his latest rubbish. My favourite "find thousand and one ways to 
make Python crashing or failing." but I don't recall a single bug report 
in the last two years from anybody regarding problems with the FSR, or 
have I missed something?


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence  wrote:
> Please don't feed him.  Your average troll is bad enough but he really takes
> the biscuit.

... someone was feeding him biscuits?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence


On 22/11/2014 17:49, Marko Rauhamaa wrote:

wxjmfa...@gmail.com:


- By chance, I found on the web a German py dev who was commenting and
he had not an updated "DUDEN" (a German dictionnary).


That... leaves me utterly speachless!


Marko



Please don't feed him.  Your average troll is bad enough but he really 
takes the biscuit.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

wxjmfa...@gmail.com:

> - By chance, I found on the web a German py dev who was commenting and
> he had not an updated "DUDEN" (a German dictionnary).

That... leaves me utterly speachless!


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Rustom Mody

On Saturday, November 22, 2014 8:14:15 PM UTC+5:30, Roy Smith wrote:
>  Marko Rauhamaa wrote:
> 
> > Steven D'Aprano:
> > 
> > > You haven't given any good reason for objecting to calling Unicode
> > > strings by what they are. Maybe you think that it is an implementation
> > > detail, and that some version of Python might suddenly and without
> > > warning change to only supporting KOI8-R strings or GB2312 strings? If
> > > so, you are badly mistaken. The fact that Python strings are Unicode
> > > is not an implementation detail, it is part of the language semantics.
> > 
> > To me, repeating the word Unicode everywhere is giving the (in and of
> > itself impressive) standard too primary a status. While understanding
> > how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
> > useful and occasionally can be taken explicit advantage of, those really
> > are mundane techniques to implement abstractions.
> > 
> > Python's strings exist (primarily) so you can express utterances in a
> > human language, aka plain text. They don't exist to express Unicode code
> > points. That would be putting the cart before the horse.
> > 
> > > "Rectangular door" makes perfect sense, and in a world where there are
> > > dozens of legacy non-rectangular doors, it would be very sensible to
> > > specify the kind of door.
> > 
> > It makes sense, and yet, I've never heard anyone talk about rectangular
> > doors even though I use numerous doors every day. Why is it, then, that
> > people feel the constant need to add the "Unicode" epithet to Python's
> > strings, which -- according to its own specification -- are just
> > strings?
> > 
> > 
> > Marko
> 
> There's a old joke to the effect that the fields of study which are 
> confident that they're really doing science (i.e. chemistry, biology, 
> physics, astronomy, etc) don't put the word "science" in their names.  
> It's only the fields of study that are less confident about their status 
> as sciences (computer science, behavioral science, political science, 
> etc) that feel the need to explicitly say "science".  As if repeating it 
> enough times makes it true.  I wonder if something of the same thing 
> applies here?  
> 
> Somewhat more seriously, the IEEE-754 point is quite apropos.  Back when 
> 754 first came out, there were lots of different floating point 
> implementations.  Machines that used 754 touted it in their sales 
> literature and mentioned it all over their documentation.  These days, 
> 754 is so ubiquitous, nobody even thinks to mention it, in the same way 
> nobody bothers to mention 2's complement integers.  I suspect that some 
> day, the same thing will happen with Unicode.  For that matter, we will 
> eventually get to the point where when people say, "just plain text", 
> they will mean Unicode, in the same way that "just plain text" today 
> really means ASCII (and the text/plain MIME type will become a 
> historical curiosity).

Yes this was my point also -- encodings in general and unicode in
particular is a mess (as of 2014).  Maybe in a few years the dust 
will settle.  Then saying 'unicode' will become redundant.
But until then when we have a rather leaky abstraction having
sealing liquid on the hands is preferable to sewage in the house.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

Roy Smith :

> For that matter, we will eventually get to the point where when people
> say, "just plain text", they will mean Unicode, in the same way that
> "just plain text" today really means ASCII (and the text/plain MIME
> type will become a historical curiosity).

MIME has:

   Content-Type: text/plain; charset="UTF-8"

(even though UTF-8 isn't a character set but a content encoding).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Roy Smith

In article <87y4r348uf@elektro.pacujo.net>,
 Marko Rauhamaa  wrote:

> Steven D'Aprano :
> 
> > You haven't given any good reason for objecting to calling Unicode
> > strings by what they are. Maybe you think that it is an implementation
> > detail, and that some version of Python might suddenly and without
> > warning change to only supporting KOI8-R strings or GB2312 strings? If
> > so, you are badly mistaken. The fact that Python strings are Unicode
> > is not an implementation detail, it is part of the language semantics.
> 
> To me, repeating the word Unicode everywhere is giving the (in and of
> itself impressive) standard too primary a status. While understanding
> how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
> useful and occasionally can be taken explicit advantage of, those really
> are mundane techniques to implement abstractions.
> 
> Python's strings exist (primarily) so you can express utterances in a
> human language, aka plain text. They don't exist to express Unicode code
> points. That would be putting the cart before the horse.
> 
> > "Rectangular door" makes perfect sense, and in a world where there are
> > dozens of legacy non-rectangular doors, it would be very sensible to
> > specify the kind of door.
> 
> It makes sense, and yet, I've never heard anyone talk about rectangular
> doors even though I use numerous doors every day. Why is it, then, that
> people feel the constant need to add the "Unicode" epithet to Python's
> strings, which -- according to its own specification -- are just
> strings?
> 
> 
> Marko

There's a old joke to the effect that the fields of study which are 
confident that they're really doing science (i.e. chemistry, biology, 
physics, astronomy, etc) don't put the word "science" in their names.  
It's only the fields of study that are less confident about their status 
as sciences (computer science, behavioral science, political science, 
etc) that feel the need to explicitly say "science".  As if repeating it 
enough times makes it true.  I wonder if something of the same thing 
applies here?  

Somewhat more seriously, the IEEE-754 point is quite apropos.  Back when 
754 first came out, there were lots of different floating point 
implementations.  Machines that used 754 touted it in their sales 
literature and mentioned it all over their documentation.  These days, 
754 is so ubiquitous, nobody even thinks to mention it, in the same way 
nobody bothers to mention 2's complement integers.  I suspect that some 
day, the same thing will happen with Unicode.  For that matter, we will 
eventually get to the point where when people say, "just plain text", 
they will mean Unicode, in the same way that "just plain text" today 
really means ASCII (and the text/plain MIME type will become a 
historical curiosity).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

Steven D'Aprano :

> You haven't given any good reason for objecting to calling Unicode
> strings by what they are. Maybe you think that it is an implementation
> detail, and that some version of Python might suddenly and without
> warning change to only supporting KOI8-R strings or GB2312 strings? If
> so, you are badly mistaken. The fact that Python strings are Unicode
> is not an implementation detail, it is part of the language semantics.

To me, repeating the word Unicode everywhere is giving the (in and of
itself impressive) standard too primary a status. While understanding
how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
useful and occasionally can be taken explicit advantage of, those really
are mundane techniques to implement abstractions.

Python's strings exist (primarily) so you can express utterances in a
human language, aka plain text. They don't exist to express Unicode code
points. That would be putting the cart before the horse.

> "Rectangular door" makes perfect sense, and in a world where there are
> dozens of legacy non-rectangular doors, it would be very sensible to
> specify the kind of door.

It makes sense, and yet, I've never heard anyone talk about rectangular
doors even though I use numerous doors every day. Why is it, then, that
people feel the constant need to add the "Unicode" epithet to Python's
strings, which -- according to its own specification -- are just
strings?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 12:50 AM, Steven D'Aprano
 wrote:
> "Tire car" makes no sense. "Rectangular door" makes perfect sense, and in a
> world where there are dozens of legacy non-rectangular doors, it would be
> very sensible to specify the kind of door. Just as we specify sliding door,
> glass door, security door, fire door, flyscreen wire door, and so on.

Not just legacy - scifi often has non-rectangular doors. (And they're
often HORRIBLY impractical. I think the rectangular door is here to
stay.) But English is a strange beast. A glass door is made of
glass... a flyscreen wire door is made of (at least, has a significant
component of) flyscreen, but a fire door isn't made of fire...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Steven D'Aprano

Marko Rauhamaa wrote:

> Steven D'Aprano :
> 
>> In Python, we have Unicode strings and byte strings.
> 
> No, you don't. You have strings and bytes:

Python has strings of Unicode code points, a.k.a. "Unicode strings",
or "text strings", and strings of bytes, a.k.a. "byte strings". These are
the plain English descriptive names of the types "str" and "bytes". 

>   Textual data in Python is handled with str objects, or strings.
>   Strings are immutable sequences of Unicode code points. String
>   literals are written in a variety of ways: [...]

Hence, Unicode string.

>   https://docs.python.org/3/library/stdtypes.html#text-sequence-typ
>   e-str>
> 
>   The core built-in types for manipulating binary data are bytes and
>   bytearray.

Which are strings of bytes.

>   https://docs.python.org/3/library/stdtypes.html#binary-sequence-t
>   ypes-bytes-bytearray-memoryview
> 
> 
> Equivalently, I wouldn't mind "character strings" vs "byte strings".

Unicode strings are not strings of characters, except informally. Some code
points represent non-characters:

http://www.unicode.org/faq/private_use.html#nonchar1

They are strings of Unicode code points, but "code point string" is firstly
an inelegant and ugly phrase, and secondly ambiguous. What sort of code
points? Baudot codes? ASCII codes? Big5 codes? Tron codes? No, none of the
above, they are *Unicode* code points.

You haven't given any good reason for objecting to calling Unicode strings
by what they are. Maybe you think that it is an implementation detail, and
that some version of Python might suddenly and without warning change to
only supporting KOI8-R strings or GB2312 strings? If so, you are badly
mistaken. The fact that Python strings are Unicode is not an implementation
detail, it is part of the language semantics.

> Unicode strings is not wrong but the technical emphasis on Unicode is as
> strange as a "tire car" or "rectangular door" when "car" and "door" are
> what you usually mean.

"Tire car" makes no sense. "Rectangular door" makes perfect sense, and in a
world where there are dozens of legacy non-rectangular doors, it would be
very sensible to specify the kind of door. Just as we specify sliding door,
glass door, security door, fire door, flyscreen wire door, and so on.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Steven D'Aprano :

> In Python, we have Unicode strings and byte strings.

No, you don't. You have strings and bytes:

  Textual data in Python is handled with str objects, or strings.
  Strings are immutable sequences of Unicode code points. String
  literals are written in a variety of ways: [...]

  https://docs.python.org/3/library/stdtypes.html#text-sequence-typ
  e-str>

  The core built-in types for manipulating binary data are bytes and bytearray.

  https://docs.python.org/3/library/stdtypes.html#binary-sequence-t
  ypes-bytes-bytearray-memoryview


Equivalently, I wouldn't mind "character strings" vs "byte strings".
Unicode strings is not wrong but the technical emphasis on Unicode is as
strange as a "tire car" or "rectangular door" when "car" and "door" are
what you usually mean.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Steven D'Aprano

Marko Rauhamaa wrote:

> Rustom Mody :
> 
>> Likewise in 2014, and given the arguments, inconsistencies, etc
>> remembering the nuts-n-bolts below the strings-represented-as-unicode
>> abstraction may be in order.
> 
> No need to hide Unicode, but talking about a
> 
>Unicode string
> 
> is like talking about an
> 
>electronic computer

versus a hydraulic computer, a mechanical computer, an optical computer, a
human computer, a genetic (DNA) computer, ... 

>visible spectrum display

I'm not sure that many people actually do refer to "visible spectrum
display", or what you mean by it, but I can easily imagine that being in
contrast with a non-visible spectrum display.

>mouse user interface

As opposed to a commandline user interface, direct brain-to-computer user
interface, touch UI, etc. Not to mention non-user interfaces, like SCSI
interface, SATA interface, USB interface, ...

>ethernet socket

Telephone socket, Appletalk socket, Firewire socket, ADB socket ...

>magnetic file

I have no idea what you mean here. Do you mean magnetic *field*? As opposed
to an electric field, gravitational field, Higgs field, strong nuclear
force field, weak nuclear force field ...

>electric power supply
> 
> The language spec calls the things just "strings," as it should.

I really don't understand what bothers you about this. In Python, we have
Unicode strings and byte strings. In computing in general, strings can
consist of Unicode characters, ASCII characters, Tron characters, EBCDID
characters, ISO-8859-7 characters, and literally dozens of others. It
boogles my mind that you are so opposed to being explicit about what sort
of string we are dealing with.

Are you equally disturbed when people distinguish between tablespoon,
teaspoon, dessert spoon and serving spoon?

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 3:36 AM, Marko Rauhamaa  wrote:
> No need to hide Unicode, but talking about a
>
>Unicode string
>
> is like talking about an
>
>electronic computer
>
>visible spectrum display
>
>mouse user interface
>
>ethernet socket
>
>magnetic file
>
>electric power supply
>
> The language spec calls the things just "strings," as it should.

I'm not sure what you mean here, because the adjectives all cut out
other common constructs - a byte string, an analog computer, an IR or
UV display, a blind-compatible UI, a Unix domain socket, an in-memory
file, and a diesel power supply. Okay, I'm pushing it with the last
one (they're usually called gen sets, not power supplies), and I don't
often hear people talk about "magnetic files", but the rest are
definitely valid comparison/contrast terms.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Rustom Mody :

> Likewise in 2014, and given the arguments, inconsistencies, etc
> remembering the nuts-n-bolts below the strings-represented-as-unicode
> abstraction may be in order.

No need to hide Unicode, but talking about a

   Unicode string

is like talking about an

   electronic computer

   visible spectrum display

   mouse user interface

   ethernet socket

   magnetic file

   electric power supply

The language spec calls the things just "strings," as it should.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 3:11 AM, Francis Moreau  wrote:
> Yes I finally used str() since only setlocale() reported to have some
> issues with unicode_literals active in my appliction.
>
> Thanks Chris for your useful insight.

My pleasure. Unicode is a bit of a hobby-horse of mine, so I'm always
happy to see people getting things right :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Francis Moreau

On 11/20/2014 04:15 PM, Chris Angelico wrote:
> On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau  
> wrote:
>> Hi,
>>
>> Thanks for the "from __future__ import unicode_literals" trick, it makes
>> that switch much less intrusive.
>>
>> However it seems that I will suddenly be trapped by all modules which
>> are not prepared to handle unicode. For example:
>>
>>  >>> from __future__ import unicode_literals
>>  >>> import locale
>>  >>> locale.setlocale(locale.LC_ALL, 'fr_FR')
>>  Traceback (most recent call last):
>>File "", line 1, in 
>>File "/usr/lib64/python2.7/locale.py", line 546, in setlocale
>>  locale = normalize(_build_localename(locale))
>>File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename
>>  language, encoding = localetuple
>>  ValueError: too many values to unpack
>>
>> Is the locale module an exception and in that case I'll fix it by doing:
>>
>>  >>> locale.setlocale(locale.LC_ALL, b'fr_FR')
>>
>> or is a (big) part of the modules in python 2.7 still not ready for
>> unicode and in that case I have to decide which prefix (u or b) I should
>> manually add ?
> 
> Sadly, there are quite a lot of parts of Python 2 that simply don't
> handle Unicode strings. But you can probably keep all of those down to
> just a handful of explicit b"whatever" strings; most places should
> accept unicode as well as str. What you're seeing here is a prime
> example of one of this author's points (caution, long post):
> 
> http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
> 
> """The lesson of Python 3 is: give programmers a Unicode string type,
> *make it the default*, and encoding issues will /mostly/ go away."""
> 
> There's a whole ecosystem to Python 2 - some in the standard library,
> heaps more in the rest of the world - and a lot of it was written on
> the assumption that a byte is a character is an octet. When you pass
> Unicode strings to functions written to expect byte strings, sometimes
> you win, and sometimes you lose... even with the standard library
> itself. But the Python 3 ecosystem has been written on the assumption
> that strings are Unicode. It's only a narrow set of programs
> ("boundary code", where you're moving text across networks and stuff
> like that) where the Python 2 model is easier to work with; and the
> recent Py3 releases have been progressively working to relieve that
> pain.
> 
> The absolute worst case is a function which exists in Python 2 and 3,
> and requires a byte string in Py2 and a text string in Py3. Sadly,
> that may be exactly what locale.setlocale() is. For that, I would
> suggest explicitly passing stuff through str():
> 
> locale.setlocale(locale.LC_ALL, str('fr_FR'))
> 
> In Python 3, 'fr_FR' is already a str, so passing it through str()
> will have no significant effect. (Though it would be worth commenting
> that, to make it clear to a subsequent reader that this is Py2 compat
> code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode,
> so passing it through str() will encode it to ASCII, producing a byte
> string that setlocale should be happy with.
> 
> By the way, the reason for the strange error message is clearer in
> Python 3, which chains in another exception:
> 
 locale.setlocale(locale.LC_ALL, b'fr_FR')
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.5/locale.py", line 498, in _build_localename
> language, encoding = localetuple
> ValueError: too many values to unpack (expected 2)
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/lib/python3.5/locale.py", line 594, in setlocale
> locale = normalize(_build_localename(locale))
>   File "/usr/local/lib/python3.5/locale.py", line 507, in _build_localename
> raise TypeError('Locale must be None, a string, or an iterable of
> two strings -- language code, encoding.')
> TypeError: Locale must be None, a string, or an iterable of two
> strings -- language code, encoding.
> 
> So when it gets the wrong type of string, it attempts to unpack it as
> an iterable; it yields five values (the five bytes or characters,
> depending on which way it's the wrong type of string), but it's
> expecting two. Fortunately, str() will deal with this. But make sure
> you don't have the b prefix, or str() in Py3 will give you quite a
> different result!
> 

Yes I finally used str() since only setlocale() reported to have some
issues with unicode_literals active in my appliction.

Thanks Chris for your useful insight.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Rustom Mody

On Friday, November 21, 2014 12:06:54 PM UTC+5:30, Marko Rauhamaa wrote:
> Chris Angelico :
> 
> > On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa  wrote:
> >> I don't really like it how Unicode is equated with text, or even
> >> character strings.
> > [...]
> > Do you have actual text that you're unable to represent in Unicode?
> 
> Not my point at all.
> 
> I'm saying equating an abstract data type (string) with its
> representation (Unicode vector) is bad taste.
> 
> > We don't call numbers IEEE,
> 
> Exactly.
> 
> > Do you genuinely have text that you can't represent in Unicode, or are
> > you just arguing against Unicode to try to justify "Python strings are
> > " as a basis for your code?
> 
> Nobody is arguing against Unicode. I'm saying, let's talk about the
> forest instead of the trees (except when the trees really are the
> focus).

Ive always felt the makers of C showed remarkably good taste in 
the names 'int' and 'float'. Unlike:
Pascal: Int and Real
PL/1: Fixed and Float

IOW the more leaky abstraction used for real numbers is explicitly reminded.

Likewise in 2014, and given the arguments, inconsistencies, etc
remembering the nuts-n-bolts below the strings-represented-as-unicode
abstraction may be in order.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Tim Chase

On 2014-11-22 02:23, Steven D'Aprano wrote:
> LATIN SMALL LETTER E
> COMBINING CIRCUMFLEX ACCENT
> 
> then my application should treat that as a single "character" and
> display it as:
> 
> LATIN SMALL LETTER E WITH CIRCUMFLEX
> 
> which looks like this: ê
> 
> rather than two distinct "characters" eˆ
> 
> Now, that specific example is a no-brainer, because the Unicode
> normalization routines will handle the conversion. But not every
> combination of accented characters has a canonical combined form.
> What about something like this?
> 
> 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING
> CARON}'
> 
> If I insert a character into my string, I want to be able to insert
> before the w or after the caron, but not in the middle of those
> three code points.

Things get even weirder if you have

 '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING
 OGONEK}\N{COMBINING CARON}'

and when you try to do comparisons like

 s1 = '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}'
 s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}'
 s3 = 'e\N{COMBINING OGONEK}\N{COMBINING CIRCUMFLEX ACCENT}'
 print(s1 == s2)
 print(s1 == s3)
 print(s2 == s3)

Then you also have the case where you want to edit text and the user
wants to remove the COMBINING OGONEK from the character, so you *do*
want to do something akin to

 s4 = ''.join(c for c in s3 if c != '\N{COMBINING OGONEK}')

And yet, weird things happen if you try to remove the circumflex:

  for test in (s1, s2, s3):
print(test == ''.join(
  c for c in test if c != '\N{COMBINING CIRCUMFLEX ACCENT}'
  )

They all make sense if you understand what's going on under the hood,
but from a visual/conceptual perspective, something feels amiss.

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 2:23 AM, Steven D'Aprano
 wrote:
> Chris Angelico wrote:
>
>> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
>>  wrote:
>>> (E.g. there are millions of existing files across the world containing
>>> text which use legacy encodings that are not compatible with Unicode.)
>>
>> Not compatible with Unicode? There aren't many character sets out
>> there that include characters not in Unicode - that was the whole
>> point. Of course, there are plenty of files in unspecified eight-bit
>> encodings, so you may have a problem with reliable decoding - but if
>> you know what the encoding is, you ought to be able to represent each
>> character in Unicode.
>
> What I meant was that some encodings -- namely ASCII and Latin-1 -- the
> ordinals are exactly equivalent to Unicode, that is:
>
> That's not quite as significant as I thought, though. What is significant is
> that a pure ASCII file on disk can be read by a program assuming UTF-8:
>
> although the same is not the case for Latin-1 encoded files.

Yep. Thing is, Unicode can't magically convert all files on all
disks... but with a good codec library, you can at least convert
things as you find them. (I was reading MacRoman files earlier this
year. THAT is an encoding I didn't expect I'd find in 2014.)

> Well, yes. My point, agreeing with Marko, is that any time you want to do
> something even vaguely related to human-readable text, "code points" are
> not enough. ... What about something like this?
>
> 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'
>
> If I insert a character into my string, I want to be able to insert before
> the w or after the caron, but not in the middle of those three code points.

Yes, which is a concern. Also a concern is the ability to detect other
boundaries, like words. None of these can be easily solved; all of
them can be dealt with by using the Unicode character data, which is
better than you get for most legacy encodings. In terms of Python
strings, it still makes sense to insert characters between those
combining characters; so what you're saying is that a text editor
widget needs to be aware of more than just code points. Which is
trivially obvious in the presence of RTL text, too; cursor positions
through differing-direction text will be an issue.

The problems you're citing aren't Unicode problems. They stem from the
complexities of human languages. Unicode just makes them a bit more
visible to English-only-speaking programmers.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Steven D'Aprano

Chris Angelico wrote:

> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
>  wrote:
>> (E.g. there are millions of existing files across the world containing
>> text which use legacy encodings that are not compatible with Unicode.)
> 
> Not compatible with Unicode? There aren't many character sets out
> there that include characters not in Unicode - that was the whole
> point. Of course, there are plenty of files in unspecified eight-bit
> encodings, so you may have a problem with reliable decoding - but if
> you know what the encoding is, you ought to be able to represent each
> character in Unicode.

What I meant was that some encodings -- namely ASCII and Latin-1 -- the
ordinals are exactly equivalent to Unicode, that is:

# Python 3
for i in range(128):
assert chr(i).encode('ASCII') == bytes([i])

for i in range(256):
assert chr(i).encode('Latin-1') == bytes([i])

That's not quite as significant as I thought, though. What is significant is
that a pure ASCII file on disk can be read by a program assuming UTF-8:

for i in range(128):
assert chr(i).encode('UTF-8') == bytes([i])

although the same is not the case for Latin-1 encoded files.

> Not compatible with any of the UTFs, that's different. Plenty of that
> in the world.
> 
>> You are certainly correct that in it's full generality, "text" is much
>> more than just a string of code points. Unicode strings is a primitive
>> data type. A powerful and sophisticated text processing application may
>> even find Python strings too primitive, possibly needing something like
>> ropes of graphemes rather than strings of code points.
> 
> That's probably more an efficiency point, though. It should be
> possible to do a perfect two-way translation between your grapheme
> rope and a Python string; otherwise, you'll have great difficulty
> saving your file to the disk (which will normally involve representing
> the text in Unicode, then encoding that to bytes).

Well, yes. My point, agreeing with Marko, is that any time you want to do
something even vaguely related to human-readable text, "code points" are
not enough. For example, if I give a string containing the following two
code points in this order:

LATIN SMALL LETTER E
COMBINING CIRCUMFLEX ACCENT

then my application should treat that as a single "character" and display it
as:

LATIN SMALL LETTER E WITH CIRCUMFLEX

which looks like this: ê

rather than two distinct "characters" eˆ

Now, that specific example is a no-brainer, because the Unicode
normalization routines will handle the conversion. But not every
combination of accented characters has a canonical combined form. What
about something like this?

'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'

If I insert a character into my string, I want to be able to insert before
the w or after the caron, but not in the middle of those three code points.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Fri, Nov 21, 2014 at 7:16 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> Then you need to read more about Unicode. The *codepoint* for the
>> letter 'A' is 65. That is not Unicode, that is one part of the Unicode
>> spec.
>
> I don't think Python users need to know anything more about Unicode than
> they need to know about IEEE-754.
>
> How many bits are reserved for the mantissa? I don't remember and I
> don't see why I should care.

At what point can a Python float no longer represent every integer?
That's why you should care.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Chris Angelico :

> Then you need to read more about Unicode. The *codepoint* for the
> letter 'A' is 65. That is not Unicode, that is one part of the Unicode
> spec.

I don't think Python users need to know anything more about Unicode than
they need to know about IEEE-754.

How many bits are reserved for the mantissa? I don't remember and I
don't see why I should care.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 6:14 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa  wrote:
>>> I'm saying equating an abstract data type (string) with its
>>> representation (Unicode vector) is bad taste.
>>
>> What about "sequence of Unicode code points" is "representation"? What
>> is your abstraction over that?
>
> The letter 'A' is a character. Unicode for the letter 'A' is 65. It is
> very rarely that you care about that number. You are only interested in
> the letter 'A', which you can use to spell people's names, for instance.
>
> When you read a book, you read the text, not the ink.

Then you need to read more about Unicode. The *codepoint* for the
letter 'A' is 65. That is not Unicode, that is one part of the Unicode
spec.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Chris Angelico :

> On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa  wrote:
>> I'm saying equating an abstract data type (string) with its
>> representation (Unicode vector) is bad taste.
>
> What about "sequence of Unicode code points" is "representation"? What
> is your abstraction over that?

The letter 'A' is a character. Unicode for the letter 'A' is 65. It is
very rarely that you care about that number. You are only interested in
the letter 'A', which you can use to spell people's names, for instance.

When you read a book, you read the text, not the ink.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa  wrote:
>>> I don't really like it how Unicode is equated with text, or even
>>> character strings.
>> [...]
>> Do you have actual text that you're unable to represent in Unicode?
>
> Not my point at all.
>
> I'm saying equating an abstract data type (string) with its
> representation (Unicode vector) is bad taste.

What about "sequence of Unicode code points" is "representation"? What
is your abstraction over that?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Chris Angelico :

> On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa  wrote:
>> I don't really like it how Unicode is equated with text, or even
>> character strings.
> [...]
> Do you have actual text that you're unable to represent in Unicode?

Not my point at all.

I'm saying equating an abstract data type (string) with its
representation (Unicode vector) is bad taste.

> We don't call numbers IEEE,

Exactly.

> Do you genuinely have text that you can't represent in Unicode, or are
> you just arguing against Unicode to try to justify "Python strings are
> " as a basis for your code?

Nobody is arguing against Unicode. I'm saying, let's talk about the
forest instead of the trees (except when the trees really are the
focus).

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 12:31 PM,   wrote:
> On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
>> 2) Languages which use a different alphabet (eg Cyrillic - Russian,
>> Bulgarian). You could possibly cram them into an eight-bit encoding
>> without tipping ASCII out, but I'm not sure. In Unicode, these
>> languages are all easily supported by the BMP, as they don't use a
>> huge number of characters each.
>
> There are numerous eight-bit encodings that support latin and one other
> alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
> encoding is basically two seven-bit encodings.

I'm aware of this; Greek, for instance, fits quite happily into
ISO-8859-7, which is eight-bit.

> The most difficult (of those still possible at all) language to encode
> in eight bits is actually Vietnamese, which uses the Latin alphabet, due
> to the sheer number of accented letters used. Windows' encoding of it
> (along with some other lesser used encodings, all for Vietnamese) is the
> only 8-bit encoding to use combining accents, in a way unfortunately
> incompatible with unicode normalization if naively translated, whereas
> VISCII sacrifices a handful of C0 control characters in addition to
> fully packing the high half with letters.

This is what I was suspicious of. The very notion of "combining
accents" already breaks the notion that "a byte is a character is a
glyph", which most eight-bit encodings try to pretend. In any case,
the BMP still easily copes with them all.

(Hmm. I wonder how you'd typeset the old "Self-Pronouncing Alphabet"
for English? It's basically English text with a few markings added to
letters - not standard diacriticals that already exist in Unicode, but
dots. Probably possible, one way or another... but I haven't seen SPA
text since the 90s, and that was in stuff published back in the 80s or
so.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
> 2) Languages which use a different alphabet (eg Cyrillic - Russian,
> Bulgarian). You could possibly cram them into an eight-bit encoding
> without tipping ASCII out, but I'm not sure. In Unicode, these
> languages are all easily supported by the BMP, as they don't use a
> huge number of characters each.

There are numerous eight-bit encodings that support latin and one other
alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
encoding is basically two seven-bit encodings.

The most difficult (of those still possible at all) language to encode
in eight bits is actually Vietnamese, which uses the Latin alphabet, due
to the sheer number of accented letters used. Windows' encoding of it
(along with some other lesser used encodings, all for Vietnamese) is the
only 8-bit encoding to use combining accents, in a way unfortunately
incompatible with unicode normalization if naively translated, whereas
VISCII sacrifices a handful of C0 control characters in addition to
fully packing the high half with letters.

-- 
Random832
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
 wrote:
> (E.g. there are millions of existing files across the world containing text
> which use legacy encodings that are not compatible with Unicode.)

Not compatible with Unicode? There aren't many character sets out
there that include characters not in Unicode - that was the whole
point. Of course, there are plenty of files in unspecified eight-bit
encodings, so you may have a problem with reliable decoding - but if
you know what the encoding is, you ought to be able to represent each
character in Unicode.

Not compatible with any of the UTFs, that's different. Plenty of that
in the world.

> You are certainly correct that in it's full generality, "text" is much more
> than just a string of code points. Unicode strings is a primitive data
> type. A powerful and sophisticated text processing application may even
> find Python strings too primitive, possibly needing something like ropes of
> graphemes rather than strings of code points.

That's probably more an efficiency point, though. It should be
possible to do a perfect two-way translation between your grapheme
rope and a Python string; otherwise, you'll have great difficulty
saving your file to the disk (which will normally involve representing
the text in Unicode, then encoding that to bytes).

To be sure, a Python string is a poor representational form for a text
editor. But that's largely because it's immutable, so every little
edit would involve massive copying. Depending on what you're doing, it
might be worth using a chunked UTF-8 byte stream (allowing for
insertion at any chunk boundary), or an array of lines, or something
grapheme-based... but all of those questions are performance, not
correctness, issues.

> We Western and Northern European speakers -- and I don't know whether Finns
> are counted as Northern Europeans or Eastern Europeans -- are lucky in that
> our natural languages are well-covered by Unicode. All our graphemes are
> also code points, even the "funny ones with accents". As an English
> speaker. I have to remind myself that not every grapheme is a single code
> point, but Devanagari or Navajo writers will never make that mistake.

I've been working with different languages a bit, lately. Broadly
speaking, you have:

1) Languages which use the Roman alphabet, plus a handful of other
characters (eg Finnish, German). These can be represented largely in
ASCII, and used to be handled fairly easily with a single codepage -
an eight-bit ASCII-compatible encoding.

2) Languages which use a different alphabet (eg Cyrillic - Russian,
Bulgarian). You could possibly cram them into an eight-bit encoding
without tipping ASCII out, but I'm not sure. In Unicode, these
languages are all easily supported by the BMP, as they don't use a
huge number of characters each.

3) Languages which use a non-alphabetic system (eg Korean). I think
they're all still covered by the BMP, but there's no way you can fit
them into eight-bit encodings - one single language will use more than
256 symbols.

4) Ancient, esoteric, or symbolic writing systems. Not fundamentally
different from the above categories except that they're less used, and
the BMP has finite space. These will definitely need the SMP.

But all of them are covered by Unicode. (Sadly, they are NOT all
covered by all fonts, so I've been finding that certain pieces of text
come out as strings of little boxes. But I can at least manipulate the
text, even if I can't read it back.) I can, for example, zip lines of
text like this:

English:
Let it go, let it go!
I am one with the wind and sky
Let it go, let it go!
You'll never see me cry!

Icelandic:
Þetta er nóg, þetta er nóg
Uppi í himni eins og vindablær
Þetta er nóg, komið nóg
Og tár mín enginn sér fær

Russian:
Отпусти и забудь,
Этот мир из твоих грёз.
Отпусти и забудь,
И не будет больше слёз.

Output:
Let it go, let it go!
Þetta er nóg, þetta er nóg
Отпусти и забудь,

I am one with the wind and sky
Uppi í himni eins og vindablær
Этот мир из твоих грёз.

Let it go, let it go!
Þetta er nóg, komið nóg
Отпусти и забудь,

You'll never see me cry!
Og tár mín enginn sér fær
И не будет больше слёз.

In fact, it's trivially easy to write something like this, because all
this text is Unicode. ALL of these languages (and plenty more) are
"well-covered by Unicode". There's still the ongoing debate of Han
unification, plus the progressive work of adding characters for
ancient scripts and such, but AFAIK, all writing systems currently in
use are covered.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Steven D'Aprano

Marko Rauhamaa wrote:

> Michael Torrie :
> 
>> Unicode can only be encoded to bytes.
>> Bytes can only be decoded to unicode.
> 
> I don't really like it how Unicode is equated with text, or even
> character strings.

That surely depends on the context. To be technically correct, Unicode is a
character set together with a set of rules for dealing with them (e.g.
rules for uppercasing characters, sorting rules, etc.). When referring to
the standard, "Unicode" is a noun; when referring to text, it is actually
an adjective being used as a noun. That is, "Unicode text" has become
abbreviated as just "Unicode" in much the same way as "human beings" has
become abbreviated as just "humans".

In that sense, "text is Unicode" just means "in the context in which we are
talking, when I say 'text' I mean 'Unicode text' as opposed to (for
example) 'ASCII text' or 'KOI-8 text'." It certainly doesn't mean that
*all* text in other contexts are Unicode, since that is obviously untrue.

(E.g. there are millions of existing files across the world containing text
which use legacy encodings that are not compatible with Unicode.)

> There's barely any difference between the truth value of these
> statements:
> 
>Python strings are ASCII.
> 
>Python strings are Latin-1.
> 
>Python strings are Unicode.
> 
> Each of those statements is true as long as you stay within the
> respective character sets, and cease to be true when your text contains
> characters outside the character sets.

When we say "Python strings are FOO", we are making a statement about
arbitrary Python strings, not a particular set of concrete examples of
strings. If Python strings are FOO, that means that for all possible Python
strings s, "s is FOO" is a true statement.

We cannot say that Python strings are uppercase, because we can easily find
counter-examples such as 'xyz'. Likewise we cannot say Python strings are
ASCII, or Latin-1, because we can easily find counter-examples such as 'Ř'

On the other hand, Python strings *are* Unicode, because by design Python
strings are limited to Unicode. Every Python string is a Unicode string.

> Now, it is true that Python currently limits itself to the 1,114,112
> Unicode code points. And it likely won't adopt more characters unless
> Unicode does it first. However, text is something more lofty and
> abstract than a sequence of Unicode code points.

You are certainly correct that in it's full generality, "text" is much more
than just a string of code points. Unicode strings is a primitive data
type. A powerful and sophisticated text processing application may even
find Python strings too primitive, possibly needing something like ropes of
graphemes rather than strings of code points.

We Western and Northern European speakers -- and I don't know whether Finns
are counted as Northern Europeans or Eastern Europeans -- are lucky in that
our natural languages are well-covered by Unicode. All our graphemes are
also code points, even the "funny ones with accents". As an English
speaker. I have to remind myself that not every grapheme is a single code
point, but Devanagari or Navajo writers will never make that mistake.

> We shouldn't call strings Unicode any more than we call numbers IEEE or
> times ISO.

We certainly shouldn't call numbers IEEE, but we might very well call them
IEEE-754. Actually, since IEEE-754 covers multiple formats, we have to be
more specific:

Python floats are IEEE-754 double-precision binary floats.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa  wrote:
> Michael Torrie :
>
>> Unicode can only be encoded to bytes.
>> Bytes can only be decoded to unicode.
>
> I don't really like it how Unicode is equated with text, or even
> character strings.
>
> There's barely any difference between the truth value of these
> statements:
>
>Python strings are ASCII.
>
>Python strings are Latin-1.
>
>Python strings are Unicode.
>
> Each of those statements is true as long as you stay within the
> respective character sets, and cease to be true when your text contains
> characters outside the character sets.

The difference is that ASCII and Latin-1 cut out a large number of
active world languages, UCS-2 (the intermediate option you didn't
mention) cuts out a small proportion (by usage) of significant
characters, and Unicode cuts out only those characters which fall
under issues like Han unification. (Plus any that haven't yet been
allocated. But since Python doesn't actually validate code points to
ensure that they've been given meanings, you can use today's Python to
work with tomorrow's Unicode.)

Do you have actual text that you're unable to represent in Unicode? If
so, you are going to have major problems using it with *any* computer
system. There are Japanese encodings that can represent additional
characters, but they also *cannot* represent a lot of the other
characters we use, so there'll be fundamental incompatibilities.

> Now, it is true that Python currently limits itself to the 1,114,112
> Unicode code points. And it likely won't adopt more characters unless
> Unicode does it first. However, text is something more lofty and
> abstract than a sequence of Unicode code points.
>
> We shouldn't call strings Unicode any more than we call numbers IEEE or
> times ISO.

We don't call numbers IEEE, but if we're working with Python floats,
we *do* require all numbers to be representable as IEEE
floating-point. Don't like that? Pick decimal.Decimal instead, or
fractions.Fraction, and pick a different set of limitations... but
ultimately, you *will* have restrictions - and much tighter
restrictions than Unicode places on text.

Do you genuinely have text that you can't represent in Unicode, or are
you just arguing against Unicode to try to justify "Python strings are
" as a basis for your code?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 4:42 AM,   wrote:
> On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:
>>
>> Why should it encode to bytes?
>
> Because a bytes format string suggests a bytes result. Why does unicode
> always "win", rather than the type of the format string always winning?

For the same reason that float always "wins":

>>> 1.0 + 2
3.0
>>> 1 + 2.0
3.0

>> Makes much better sense to work in
>> Unicode. But mainly, it has to do one of them, and be predictable.
>
> Yeah, but string % is not a symmetrical operator. People's mental model
> of it is likely to be that it acts like format (which does use the type
> of the format string) or C sprintf/wsprintf (both of which use the same
> type for the format string and result). And literally every other type
> is converted to the type of the format string when used with %s - having
> unicode be special adds cognitive load, and it means you can't safely
> blindly use %s with an unknown object.

True, but Python 2 deliberately lets you conflate the two, so you get
a bit of convenience at the expensive of complexity when things go
wrong. Python 3, on the other hand, is much more careful about the
difference:

>>> "asdf %s qwer" % b"zxcv"
"asdf b'zxcv' qwer"
>>> b"asdf %s qwer" % "zxcv"
Traceback (most recent call last):
  File "", line 1, in 
TypeError: unsupported operand type(s) for %: 'bytes' and 'str'

So your complaint *has* been resolved... but only in Python 3, because
the change would break stuff.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832



On Thu, Nov 20, 2014, at 16:29, Ethan Furman wrote:
> If your unicode string happens to contain a base64 encoded .png, then you
> could decode that into bytes.  ;)

Bytes of the PNG, or of the raw pixels?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Ethan Furman :

> If your unicode string happens to contain a base64 encoded .png, then
> you could decode that into bytes. ;)

You could embed your PNG file in XML in binary form as CDATA. Then, your
"characters" would represent 8- or 16-bit integers. You just need to
replace all accidental occurrences of 

   ]]>

with

   ]]>]]>

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ethan Furman

On 11/20/2014 07:53 AM, Chris Angelico wrote:
> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
>> I think that you may get a Unicode/Encode/Error when you try to /decode/ a
>> unicode string is more confusing...
> 
> Hang on a minute, what does it even mean to decode a Unicode string?
> That's where the problem is. Fortunately that's one that Py3 solved -
> str simply doesn't have a decode() method.

If your unicode string happens to contain a base64 encoded .png, then you could 
decode that into bytes.  ;)

--
~Ethan~



signature.asc
Description: OpenPGP digital signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Mark Lawrence


On 20/11/2014 18:06, Ian Kelly wrote:

On Thu, Nov 20, 2014 at 10:42 AM,   wrote:

and it means you can't safely
blindly use %s with an unknown object.


You can't safely do this anyway. Whether it's %s with a str and a
unicode, or %s with a unicode and a str, *something* is going to have
to be implicitly encoded or decoded, and if ascii doesn't happen to be
the correct encoding then the result will be either an error or a
silent failure.



All I know about this encoding/decoding malarky is that I'd prefer an 
error to a silent failure any day of the week.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Michael Torrie :

> Unicode can only be encoded to bytes.
> Bytes can only be decoded to unicode.

I don't really like it how Unicode is equated with text, or even
character strings.

There's barely any difference between the truth value of these
statements:

   Python strings are ASCII.

   Python strings are Latin-1.

   Python strings are Unicode.

Each of those statements is true as long as you stay within the
respective character sets, and cease to be true when your text contains
characters outside the character sets.

Now, it is true that Python currently limits itself to the 1,114,112
Unicode code points. And it likely won't adopt more characters unless
Unicode does it first. However, text is something more lofty and
abstract than a sequence of Unicode code points.

We shouldn't call strings Unicode any more than we call numbers IEEE or
times ISO.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

random...@fastmail.us wrote:

> On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:
>> On Fri, Nov 21, 2014 at 12:59 AM,   wrote:
>> > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
>> >> >>> "%s nötig %s" % (u"üblich", u"ähnlich")
>> >> Traceback (most recent call last):
>> >>   File "", line 1, in 
>> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
>> >> 4: ordinal not in range(128)
>> >
>> > This is surprising to me - why is it trying to decode the format
>> > string, rather than encode the arguments?
>> 
>> Why should it encode to bytes?
> 
> Because a bytes format string suggests a bytes result. Why does unicode
> always "win", rather than the type of the format string always winning?

My guess is that when unicode was introduced the decision to propagate str 
to unicode in some cases was made because the developers expected that more 
old code that was unaware of unicode would continue to work. 

The old methods __mod__(), replace(), and join() that conceptually deal with 
strings propate while those that deal with characters -- center(), 
r/ljust(), translate() -- dont.

The newer format() method doesn't propagate which is probably due to a 
change in attitude rather than an oversight.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ian Kelly

On Thu, Nov 20, 2014 at 11:06 AM, Ian Kelly  wrote:
> On Thu, Nov 20, 2014 at 10:42 AM,   wrote:
>> and it means you can't safely
>> blindly use %s with an unknown object.
>
> You can't safely do this anyway. Whether it's %s with a str and a
> unicode, or %s with a unicode and a str, *something* is going to have
> to be implicitly encoded or decoded, and if ascii doesn't happen to be
> the correct encoding then the result will be either an error or a
> silent failure.

Also note that if you use %r instead of %s, you'll get the result you
want (although the unicode string will be quoted rather than encoded).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ian Kelly

On Thu, Nov 20, 2014 at 10:42 AM,   wrote:
> and it means you can't safely
> blindly use %s with an unknown object.

You can't safely do this anyway. Whether it's %s with a str and a
unicode, or %s with a unicode and a str, *something* is going to have
to be implicitly encoded or decoded, and if ascii doesn't happen to be
the correct encoding then the result will be either an error or a
silent failure.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:
> On Fri, Nov 21, 2014 at 12:59 AM,   wrote:
> > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
> >> >>> "%s nötig %s" % (u"üblich", u"ähnlich")
> >> Traceback (most recent call last):
> >>   File "", line 1, in 
> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
> >> ordinal not in range(128)
> >
> > This is surprising to me - why is it trying to decode the format string,
> > rather than encode the arguments?
> 
> Why should it encode to bytes?

Because a bytes format string suggests a bytes result. Why does unicode
always "win", rather than the type of the format string always winning?

> Makes much better sense to work in
> Unicode. But mainly, it has to do one of them, and be predictable.

Yeah, but string % is not a symmetrical operator. People's mental model
of it is likely to be that it acts like format (which does use the type
of the format string) or C sprintf/wsprintf (both of which use the same
type for the format string and result). And literally every other type
is converted to the type of the format string when used with %s - having
unicode be special adds cognitive load, and it means you can't safely
blindly use %s with an unknown object.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Chris Angelico wrote:

> On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten <__pete...@web.de> wrote:
>> Chris Angelico wrote:
>>
>>> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
 I think that you may get a Unicode/Encode/Error when you try to
 /decode/ a unicode string is more confusing...
>>>
>>> Hang on a minute, what does it even mean to decode a Unicode string?
>>
>> Let's not get philosophical ;)
> 
> No, I'm quite serious. 

I'm sorry I'm limited to text, otherwise I would have formatted the

";)" as 30pt blinking magenta...

> You encode Unicode text into bytes; you decode
> bytes into text. You can also encode a floating-point value into
> bytes, and decode bytes into a float. Or you could encode a large and
> complex structure into bytes, using something like pickle or json, and
> then decode those bytes later on. The pattern is always the same: the
> abstract object with meaning to a human is encoded into a concrete
> form that a computer can handle, and the concrete is decoded into the
> abstract. If you're not good at sight-reading sheet music, you'll have
> the same feeling of staring at the dots, decoding them one by one into
> this abstract thing called "music", and then being able to work with
> it.
> 
> When you try to decode a Unicode string, what happens is that Python 2
> says "Oh, you're trying to do a byte-string operation on a Unicode
> string... I'll quickly encode that to bytes for you, then do what you
> asked". That's why you can get an *en*coding error when you asked to
> *de*code - because both operations have to happen.

In an alternative universe unicode.decode() could have been implemented as a 
no-op. 

As you put it it looks like you have to find the true nature of the problem 
and then cast it into code -- a kind of essentialism. I would rather 
emphasise the process; the evolving interface changes your view on the 
underlying problem -- a hermeneutic cycle if you will.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Michael Torrie

On 11/20/2014 09:32 AM, Peter Otten wrote:
> Chris Angelico wrote:
> 
>> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
>>> I think that you may get a Unicode/Encode/Error when you try to /decode/
>>> a unicode string is more confusing...
>>
>> Hang on a minute, what does it even mean to decode a Unicode string?
> 
> Let's not get philosophical ;)

It's not philosophical.  It's an important distinction that folks need
to be clear on when understanding unicode and the errors that python can
throw.

Unicode can only be encoded to bytes.
Bytes can only be decoded to unicode.

Without understanding that, the exception errors about decoding won't be
properly understood, nor will one know how to fix them.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten <__pete...@web.de> wrote:
> Chris Angelico wrote:
>
>> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
>>> I think that you may get a Unicode/Encode/Error when you try to /decode/
>>> a unicode string is more confusing...
>>
>> Hang on a minute, what does it even mean to decode a Unicode string?
>
> Let's not get philosophical ;)

No, I'm quite serious. You encode Unicode text into bytes; you decode
bytes into text. You can also encode a floating-point value into
bytes, and decode bytes into a float. Or you could encode a large and
complex structure into bytes, using something like pickle or json, and
then decode those bytes later on. The pattern is always the same: the
abstract object with meaning to a human is encoded into a concrete
form that a computer can handle, and the concrete is decoded into the
abstract. If you're not good at sight-reading sheet music, you'll have
the same feeling of staring at the dots, decoding them one by one into
this abstract thing called "music", and then being able to work with
it.

When you try to decode a Unicode string, what happens is that Python 2
says "Oh, you're trying to do a byte-string operation on a Unicode
string... I'll quickly encode that to bytes for you, then do what you
asked". That's why you can get an *en*coding error when you asked to
*de*code - because both operations have to happen.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Chris Angelico wrote:

> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
>> I think that you may get a Unicode/Encode/Error when you try to /decode/
>> a unicode string is more confusing...
> 
> Hang on a minute, what does it even mean to decode a Unicode string?

Let's not get philosophical ;)

> That's where the problem is. Fortunately that's one that Py3 solved -
> str simply doesn't have a decode() method.



-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote:
> I think that you may get a Unicode/Encode/Error when you try to /decode/ a
> unicode string is more confusing...

Hang on a minute, what does it even mean to decode a Unicode string?
That's where the problem is. Fortunately that's one that Py3 solved -
str simply doesn't have a decode() method.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

random...@fastmail.us wrote:

> On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
>> >>> "%s nötig %s" % (u"üblich", u"ähnlich")
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>> ordinal not in range(128)
> 
> This is surprising to me - why is it trying to decode the format string,
> rather than encode the arguments?

Probably to make it easier to mix byte and unicode strings. In hindsight it 
may not have been a good idea, but it had the potential to save some memory.

I think that you may get a Unicode/Encode/Error when you try to /decode/ a 
unicode string is more confusing...

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau  wrote:
> Hi,
>
> Thanks for the "from __future__ import unicode_literals" trick, it makes
> that switch much less intrusive.
>
> However it seems that I will suddenly be trapped by all modules which
> are not prepared to handle unicode. For example:
>
>  >>> from __future__ import unicode_literals
>  >>> import locale
>  >>> locale.setlocale(locale.LC_ALL, 'fr_FR')
>  Traceback (most recent call last):
>File "", line 1, in 
>File "/usr/lib64/python2.7/locale.py", line 546, in setlocale
>  locale = normalize(_build_localename(locale))
>File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename
>  language, encoding = localetuple
>  ValueError: too many values to unpack
>
> Is the locale module an exception and in that case I'll fix it by doing:
>
>  >>> locale.setlocale(locale.LC_ALL, b'fr_FR')
>
> or is a (big) part of the modules in python 2.7 still not ready for
> unicode and in that case I have to decide which prefix (u or b) I should
> manually add ?

Sadly, there are quite a lot of parts of Python 2 that simply don't
handle Unicode strings. But you can probably keep all of those down to
just a handful of explicit b"whatever" strings; most places should
accept unicode as well as str. What you're seeing here is a prime
example of one of this author's points (caution, long post):

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

"""The lesson of Python 3 is: give programmers a Unicode string type,
*make it the default*, and encoding issues will /mostly/ go away."""

There's a whole ecosystem to Python 2 - some in the standard library,
heaps more in the rest of the world - and a lot of it was written on
the assumption that a byte is a character is an octet. When you pass
Unicode strings to functions written to expect byte strings, sometimes
you win, and sometimes you lose... even with the standard library
itself. But the Python 3 ecosystem has been written on the assumption
that strings are Unicode. It's only a narrow set of programs
("boundary code", where you're moving text across networks and stuff
like that) where the Python 2 model is easier to work with; and the
recent Py3 releases have been progressively working to relieve that
pain.

The absolute worst case is a function which exists in Python 2 and 3,
and requires a byte string in Py2 and a text string in Py3. Sadly,
that may be exactly what locale.setlocale() is. For that, I would
suggest explicitly passing stuff through str():

locale.setlocale(locale.LC_ALL, str('fr_FR'))

In Python 3, 'fr_FR' is already a str, so passing it through str()
will have no significant effect. (Though it would be worth commenting
that, to make it clear to a subsequent reader that this is Py2 compat
code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode,
so passing it through str() will encode it to ASCII, producing a byte
string that setlocale should be happy with.

By the way, the reason for the strange error message is clearer in
Python 3, which chains in another exception:

>>> locale.setlocale(locale.LC_ALL, b'fr_FR')
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/locale.py", line 498, in _build_localename
language, encoding = localetuple
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/local/lib/python3.5/locale.py", line 594, in setlocale
locale = normalize(_build_localename(locale))
  File "/usr/local/lib/python3.5/locale.py", line 507, in _build_localename
raise TypeError('Locale must be None, a string, or an iterable of
two strings -- language code, encoding.')
TypeError: Locale must be None, a string, or an iterable of two
strings -- language code, encoding.

So when it gets the wrong type of string, it attempts to unpack it as
an iterable; it yields five values (the five bytes or characters,
depending on which way it's the wrong type of string), but it's
expecting two. Fortunately, str() will deal with this. But make sure
you don't have the b prefix, or str() in Py3 will give you quite a
different result!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 12:59 AM,   wrote:
> On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
>> >>> "%s nötig %s" % (u"üblich", u"ähnlich")
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>> ordinal not in range(128)
>
> This is surprising to me - why is it trying to decode the format string,
> rather than encode the arguments?

Why should it encode to bytes? Makes much better sense to work in
Unicode. But mainly, it has to do one of them, and be predictable. If
you add a float and an int, you have to predictably get back one of
those two types, and since neither is a perfect superset of the other,
one just has to be picked. (And that's float, because it's more likely
to be the better option.) In this case, picking Unicode to meet on is
easily the better option, because you'll often have pure-ASCII string
literals as format strings, and Unicode data being interpolated into
it. So using an ASCII codec is far more likely to succeed if you
decode the format string than if you encode the data.

Personally, I'd much rather be very clear about what's text and what's
bytes, and not have any automatic encoding at all. That's why I use
Python 3.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Francis Moreau

Hi,

On 11/20/2014 11:47 AM, Chris Angelico wrote:
> On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau  
> wrote:
>> My question is: how should this be fixed properly ?
>>
>> A simple solution would be to force all strings passed to the
>> logger to be unicode:
>>
>>   log.debug(u"%s: %s" % ...)
>>
>> and more generally force all string in my code to be unicode by
>> using the 'u' prefix.
> 
> Yep. And then you may want to consider "from __future__ import
> unicode_literals", which will make string literals represent Unicode
> strings rather than byte strings. Basically the same as you're saying,
> only without the explicit u prefixes.

Thanks for the "from __future__ import unicode_literals" trick, it makes
that switch much less intrusive.

However it seems that I will suddenly be trapped by all modules which
are not prepared to handle unicode. For example:

 >>> from __future__ import unicode_literals
 >>> import locale
 >>> locale.setlocale(locale.LC_ALL, 'fr_FR')
 Traceback (most recent call last):
   File "", line 1, in 
   File "/usr/lib64/python2.7/locale.py", line 546, in setlocale
 locale = normalize(_build_localename(locale))
   File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename
 language, encoding = localetuple
 ValueError: too many values to unpack

Is the locale module an exception and in that case I'll fix it by doing:

 >>> locale.setlocale(locale.LC_ALL, b'fr_FR')

or is a (big) part of the modules in python 2.7 still not ready for
unicode and in that case I have to decide which prefix (u or b) I should
manually add ?

Thanks.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
> >>> "%s nötig %s" % (u"üblich", u"ähnlich")
> Traceback (most recent call last):
>   File "", line 1, in 
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: 
> ordinal not in range(128)

This is surprising to me - why is it trying to decode the format string,
rather than encode the arguments?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Thu, Nov 20, 2014 at 11:35 PM, Peter Otten <__pete...@web.de> wrote:
> You don't need to change an all-ascii bytestring to unicode.
> Lo and behold:
>
 "%s %s" % (u"üblich", u"ähnlich")
> u'\xfcblich \xe4hnlich'
 u"%s %s" % (u"üblich", u"ähnlich")
> u'\xfcblich \xe4hnlich'
>
> Only non-ascii bytestrings mean trouble, either noisy
>

It's better to not depend on that, though. Be clear and explicit about
the difference between bytes and text, and don't try to pretend
they're the same thing, even for ASCII.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Francis Moreau wrote:

> Hello,
> 
> My application is using gettext module to do the translation
> stuff. Translated messages are unicode on both python 2 and
> 3 (with python2.7 I had to explicitely asked for unicode).
> 
> A problem arises when formatting those messages before logging
> them. For example:
> 
>   log.debug("%s: %s" % (header, _("will return an unicode string")))

This is only problematic if header is a non-ascii bytestring.

> Indeed on python2.7, "%s: %s" is 'str' whereas _() returns
> unicode.
> 
> My question is: how should this be fixed properly ?
> 
> A simple solution would be to force all strings passed to the
> logger to be unicode:
> 
>   log.debug(u"%s: %s" % ...)
> 
> and more generally force all string in my code to be unicode by
> using the 'u' prefix.
> 
> or is there a proper solution ?

You don't need to change an all-ascii bytestring to unicode. 
Lo and behold:

>>> "%s %s" % (u"üblich", u"ähnlich")
u'\xfcblich \xe4hnlich'
>>> u"%s %s" % (u"üblich", u"ähnlich")
u'\xfcblich \xe4hnlich'

Only non-ascii bytestrings mean trouble, either noisy

>>> u"%s nötig %s" % (u"üblich", "ähnlich")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
ordinal not in range(128)
>>> "%s nötig %s" % (u"üblich", u"ähnlich")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: 
ordinal not in range(128)

or silently until you have to decipher the logfile contents. It's best to 
stay away from them, and the

from __future__ unicode_literals

that Chris mentionend is a convenient way to achieve that.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau  wrote:
> My question is: how should this be fixed properly ?
>
> A simple solution would be to force all strings passed to the
> logger to be unicode:
>
>   log.debug(u"%s: %s" % ...)
>
> and more generally force all string in my code to be unicode by
> using the 'u' prefix.

Yep. And then you may want to consider "from __future__ import
unicode_literals", which will make string literals represent Unicode
strings rather than byte strings. Basically the same as you're saying,
only without the explicit u prefixes.

This will also make your Py2 code behave more like the way your Py3
code does (as bare string literals are always Unicode strings in Py3).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

python 2.7 and unicode (one more time)

2014-11-20 Thread Francis Moreau

Hello,

My application is using gettext module to do the translation
stuff. Translated messages are unicode on both python 2 and
3 (with python2.7 I had to explicitely asked for unicode).

A problem arises when formatting those messages before logging
them. For example:

  log.debug("%s: %s" % (header, _("will return an unicode string")))

Indeed on python2.7, "%s: %s" is 'str' whereas _() returns
unicode.

My question is: how should this be fixed properly ?

A simple solution would be to force all strings passed to the
logger to be unicode:

  log.debug(u"%s: %s" % ...)

and more generally force all string in my code to be unicode by
using the 'u' prefix.

or is there a proper solution ?

Thanks.

-- 
https://mail.python.org/mailman/listinfo/python-list

79 matches

Mail list logo