Re: python 2.7 and unicode (one more time)
Hi Peter Otten re: There is no assignment soup_atag = whatever but there is one to atag. The whole session should when you omit the offending line > atag = soup_atag.a or insert soup_atag = soup before it. Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win 32 Type "help", "copyright", "credits" or "license" for more information. >>> import urllib2 >>> from bs4 import BeautifulSoup >>> html_atag = """Test html a tag example ... http://www.packtpub.com'>Home ... >> soup = BeautifulSoup(html_atag,'lxml') >>> atag = soup.aprint(atag) >>> atag = soup.a >>> print(atag) http://www.packtpub.com'>Home >>> type(atag) >>> tagname = atag.name >>> print tagname a >>> atag.name = 'p' >>> print (soup) Test html a tag example http://www.packtpub.com'>Home >>> atag.name = 'p' >>> print(soup) Test html a tag example http://www.packtpub.com'>Home >>> atag.name = 'a' >>> print(soup) Test html a tag example http://www.packtpub.com'>Home >>> soup_atag = soup >>> atag = soup_atag.a >>> print (atag['href']) http://www.packtpub.com'>Home >> Thank you. Yours Simon. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Tue, Nov 25, 2014 at 10:56 PM, Steven D'Aprano wrote: > I think this conversation is going nowhere, so it's probably best to end it. \0 ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: > Steven D'Aprano : > >> Marko Rauhamaa wrote: >> Py3's byte strings are still strings, though. >>> >>> Hm. I don't think so. In a plain English sense, maybe, but that kind of >>> usage can lead to confusion. >> >> Only if you are determined to confuse yourself. >> >> {...] >> >> In Python usage, "string" always refers to the `str` type, unless >> prefixed with "byte", in which case it refers to the immutable >> byte-string type (`str` in Python 2, `bytes` in Python 3.) > > You are saying what I'm saying. > > Byte strings are *not* strings. Of course they are. They are strings of bytes, just as the name suggests. > Prairie dogs are not dogs. No need to call dogs "domesticated dogs" to > tell them apart from "prairie dogs". But wild dogs *are* dogs, and there is a need to distinguish between wild dogs and domesticated dogs. Just as there is a need to distinguish between byte strings, ASCII strings, Latin-1 strings, Big5 strings, Unicode strings, Tron strings and cheese strings. I think this conversation is going nowhere, so it's probably best to end it. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Steven D'Aprano : > Marko Rauhamaa wrote: > >>> Py3's byte strings are still strings, though. >> >> Hm. I don't think so. In a plain English sense, maybe, but that kind of >> usage can lead to confusion. > > Only if you are determined to confuse yourself. > > {...] > > In Python usage, "string" always refers to the `str` type, unless > prefixed with "byte", in which case it refers to the immutable > byte-string type (`str` in Python 2, `bytes` in Python 3.) You are saying what I'm saying. Byte strings are *not* strings. Prairie dogs are not dogs. No need to call dogs "domesticated dogs" to tell them apart from "prairie dogs". Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Tue, Nov 25, 2014 at 9:56 AM, Steven D'Aprano wrote: > In all cases apart from an explicit "byte string", the word "string" is > always used for the native array-of-characters type delimited by plain > quotation marks, as used for error messages, user prompts, etc., regardless > whether the implementation is an array of 8-bit bytes (as used by Python > 2), or the full Unicode character set (as used by Python 3). So in > practice, provided you know which version of Python is being discussed, > there is never any genuine ambiguity when using the word "string" and no > excuse for confusion. And frequently, even if you're talking about Py2/Py3 cross code, there's still no ambiguity about the word "string": it means a default-for-the-language string. The locale.setlocale() function expects a string as its second parameter, for instance. (And unfortunately, flatly refuses the other sort, whichever way around that is.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: >> Py3's byte strings are still strings, though. > > Hm. I don't think so. In a plain English sense, maybe, but that kind of > usage can lead to confusion. Only if you are determined to confuse yourself. People are quite capable of interpreting correctly sentences like: "My friend Susan and I were talking about Jenny, and she said that she had had a horrible fight with her boyfriend and was breaking up with him." and despite the ambiguity correctly interpret who "she" and "her" refers to each time. Compared to that, correctly understanding the mild complexity of "string" is trivial. In Python usage, "string" always refers to the `str` type, unless prefixed with "byte", in which case it refers to the immutable byte-string type (`str` in Python 2, `bytes` in Python 3.) "Unicode string" always refers to the immutable Unicode string type (`unicode` in Python 2, `str` in Python 3). "Text string" is more ambiguous. Some people consider the prefix to be redundant, e.g. "text string" always refers to `str`, while others consider it to be in opposition to "byte string", i.e. to be a synonym for "Unicode string". In all cases apart from an explicit "byte string", the word "string" is always used for the native array-of-characters type delimited by plain quotation marks, as used for error messages, user prompts, etc., regardless whether the implementation is an array of 8-bit bytes (as used by Python 2), or the full Unicode character set (as used by Python 3). So in practice, provided you know which version of Python is being discussed, there is never any genuine ambiguity when using the word "string" and no excuse for confusion. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico : > Py3's byte strings are still strings, though. Hm. I don't think so. In a plain English sense, maybe, but that kind of usage can lead to confusion. For example, A subscription selects an item of a sequence (string, tuple or list) or mapping (dictionary) object: subscription ::= primary "[" expression_list "]" [...] A string’s items are characters. A character is not a separate data type but a string of exactly one character. https://docs.python.org/3/reference/expressions.html#subscripti ons> The text is probably a bit buggy since it skates over bytes and byte arrays listed as sequences (by https://docs.python.org/3/reference/datamodel.html>). However, your Python3 implementation would fail if it interpreted bytes objects to be strings in the above paragraph: >>> "abc"[1] 'b' >>> b'abc'[1] 98 The subscription of a *string* evaluates to a *string*. The subscription of a *bytes* object evaluates to a *number*. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Mon, Nov 24, 2014 at 5:57 PM, Marko Rauhamaa wrote: > Yes, people call strings "Unicdoe strings" because Python2 *did have* > unicode strings separate from regular strings: > > Python2Python3 > -- > string bytes (byte string) > unicode string string > > > In Python2 days, Unicode was a fancy, exotic datatype for the > connoisseurs. The rest used strings. Python3 supposedly elevates Unicode > to boring normalcy. Now it's bytes that have fallen into (unmerited) > disfavor. Py3's byte strings are still strings, though. People don't use bytearray for everything. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Gregory Ewing : > Marko Rauhamaa wrote: >> Unicode strings is not wrong but the technical emphasis on Unicode is as >> strange as a "tire car" or "rectangular door" when "car" and "door" are >> what you usually mean. > > The reason Unicode gets emphasised so much is that until relatively > recently, it *wasn't* what "string" usually meant in Python. > > When Python 3 has been around for as long as Python 2 was, things may > change. Yes, people call strings "Unicdoe strings" because Python2 *did have* unicode strings separate from regular strings: Python2Python3 -- string bytes (byte string) unicode string string In Python2 days, Unicode was a fancy, exotic datatype for the connoisseurs. The rest used strings. Python3 supposedly elevates Unicode to boring normalcy. Now it's bytes that have fallen into (unmerited) disfavor. But old habits die hard; you call cars "automobile cars" instead of "cars" since, after all, "cars" were always pulled by horses... Marko PS Maybe interestingly, Guile went through an analogous transition. As of Guile 2.0, a character is anything in the Unicode Character Database. [...] Strings are fixed-length sequences of characters. [...] A bytevector is a raw bit string. https://www.gnu.org/software/guile/manual/html_node/index.html> However, Guile 1.8 still had: The Guile implementation of character sets currently deals only with 8-bit characters. https://www.gnu.org/software/guile/docs/docs-1.8/guile-ref/inde x.html> and there were no bytevectors. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014, at 15:31, Dave Angel wrote: > I didn't realize Windows shell (DOS box) had that bug. Course I don't > use Windows much the last few years. > > it's one thing to not display it properly. It's quite another to supply > faulty data to the clipboard. Especially since the Windows clipboard > has a separate Unicode type available. It's because console bitmap fonts almost always (always?) only have one codepage's worth of characters, and it's considered better to display A for U+0100 than a blank space, and the clipboard has always been a bit of an afterthought for the windows console. Meanwhile, a truetype font is considered likely to have real glyphs for most characters a user would want to display, so no conversion is done. And there's no font rendering routine for bitmap fonts that will allow for dynamic substitution of glyphs, so it becomes a real A (or whatever) in the console buffer itself - this isn't a conversion done at clipboard-copy time. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Mon, Nov 24, 2014 at 9:51 AM, Gregory Ewing wrote: > Marko Rauhamaa wrote: >> >> Unicode strings is not wrong but the technical emphasis on Unicode is as >> strange as a "tire car" or "rectangular door" when "car" and "door" are >> what you usually mean. > > > The reason Unicode gets emphasised so much is that > until relatively recently, it *wasn't* what "string" > usually meant in Python. > > When Python 3 has been around for as long as Python > 2 was, things may change. I doubt it; the bytes() type is sufficiently stringy to require the distinction to still be made. PEP 461 makes it clear that byte strings are not blobs of opaque data, but are very definitely ASCII-compatible objects, for the benefit of boundary code. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: Unicode strings is not wrong but the technical emphasis on Unicode is as strange as a "tire car" or "rectangular door" when "car" and "door" are what you usually mean. The reason Unicode gets emphasised so much is that until relatively recently, it *wasn't* what "string" usually meant in Python. When Python 3 has been around for as long as Python 2 was, things may change. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Mon, Nov 24, 2014 at 7:31 AM, Dave Angel wrote: > On 11/23/2014 01:13 PM, random...@fastmail.us wrote: >> >> On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote: >>> >>> Why would that be possible? Many truetype fonts only supply >>> glyphs for >>> single-byte encodings (ISO-Latin-1, for example -- pop up the Windows >>> character map utility and see what some of the font files contain. >> >> >> With a bitmap font selected, the characters will be immediately replaced >> with characters present in the font's codepage, and will copy to >> clipboard as such. > > > I didn't realize Windows shell (DOS box) had that bug. Course I don't use > Windows much the last few years. Likewise. I've been accustomed to copying and pasting unrecognized characters (one of the easiest solutions is to paste them into a Python console - ord() for one character, or a Py2 repr() for multiple - to quickly see what the codepoints are), relying on the clipboard getting the exact same sequence that was printed by the application. Thanks, Windows, just what I always wanted to hear. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 11/23/2014 01:13 PM, random...@fastmail.us wrote: On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote: Why would that be possible? Many truetype fonts only supply glyphs for single-byte encodings (ISO-Latin-1, for example -- pop up the Windows character map utility and see what some of the font files contain. With a bitmap font selected, the characters will be immediately replaced with characters present in the font's codepage, and will copy to clipboard as such. I didn't realize Windows shell (DOS box) had that bug. Course I don't use Windows much the last few years. it's one thing to not display it properly. It's quite another to supply faulty data to the clipboard. Especially since the Windows clipboard has a separate Unicode type available. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote: > Why would that be possible? Many truetype fonts only supply glyphs for > single-byte encodings (ISO-Latin-1, for example -- pop up the Windows > character map utility and see what some of the font files contain. With a bitmap font selected, the characters will be immediately replaced with characters present in the font's codepage, and will copy to clipboard as such. With a truetype font (Lucida Console or Consolas) selected, the characters will be displayed as replacement glyphs (box with a question mark in it) if not present in the font, but *will still copy to the clipboard as the original code point* (which you might notice is where we started, with someone claiming success by being able to do so with codepage 65001 selected). And in any case, all characters that *are* in the font will work and display correctly, rather than only those in the OEM codepage. > Heck -- on my current machine, the True Type fonts are all old > third-party items. All the standard fonts are now Open Type. The win32 console's configuration UI refers to opentype fonts as truetype. Opentype fonts can use either truetype or type 1 as the underlying format, and all opentype fonts supplied with windows use truetype. You are being excessively pedantic in objecting to my use of the term "truetype". -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Mon, Nov 24, 2014 at 3:33 AM, Dennis Lee Bieber wrote: > On Sat, 22 Nov 2014 20:52:37 -0500, random...@fastmail.us declaimed the > following: > >>On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote: >>> ... >>> That is a standard Windows build. He is again conflating problems with >>> using the Windows command line for a given code page with the FSR. >> >>The thing is, with a truetype font selected, a correctly written win32 >>console problem should be able to print any character without caring > > Why would that be possible? Many truetype fonts only supply glyphs for > single-byte encodings (ISO-Latin-1, for example -- pop up the Windows > character map utility and see what some of the font files contain. A program should be able to print those characters even if they all look identical. Chances are you can copy and paste them into something else. But yes, finding a suitable font that covers the whole Unicode range is *hard*. I've struggled with this one with a few programs (and I still haven't managed to get VLC to satisfactorily display subtitles that include Chinese characters). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014 at 5:17 PM, Steven D'Aprano wrote: > If Python treated the character set as an implementation detail, the > programmer would have no way of knowing whether > > s = u"ö" > > is legal or not, since you cannot know whether or not ö is a supported > character in the running Python. It might work on your system, and fail for > other people. That is worse than the old distinction between "narrow" > and "wide" builds. It would be a lazy and stupid design, and especially > stupid since there really in no good alternative to Unicode today. ASCII is > not even sufficient for American English, the whole Windows code page idea > is a horrible mess, none of the legacy encodings are suitable for more than > a tiny fraction of the world. (Code pages aren't a Windows concept, of course, though I guess that's the main place where they're found on PCs today.) The only trouble with enforcing Unicode is Japanese encodings and the whole Han unification debate. Ultimately, you have to pick a side: are you siding with those who say there are fewer characters with multiple forms, or with those who say there are more distinct characters? If the former, go with Unicode. If the latter, be prepared to do heaps of work yourself, and probably be stuck with supporting only Japanese, because encodings like Shift-JIS aren't going to be able to represent Scandinavian text. Me, I'm siding with Unicode. The politicking of Han unification doesn't interest me, so I'm happy to accept a position that says that they're all the same character, just as the Roman letter A can be used in English, Italian, German, Swedish, etc, etc, etc (maybe with some combining characters for diacriticals). That gives me access to all the world's languages with a single character set and some trustworthy encodings. I think it's a fine trade-off: philosophy I don't care about versus correctness in my code. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
random...@fastmail.us wrote: > On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote: >> I really don't understand what bothers you about this. In Python, we have >> Unicode strings and byte strings. In computing in general, strings can >> consist of Unicode characters, ASCII characters, Tron characters, EBCDID >> characters, ISO-8859-7 characters, and literally dozens of others. It >> boogles my mind that you are so opposed to being explicit about what sort >> of string we are dealing with. > > I think he means that it should be implementation-defined with an API > that does not allow programs to make assumptions about the encoding, > like C. To allow for implementations that use a different character set. Python is not C, and doesn't make every second thing undefined behaviour. If Python treated the character set as an implementation detail, the programmer would have no way of knowing whether s = u"ö" is legal or not, since you cannot know whether or not ö is a supported character in the running Python. It might work on your system, and fail for other people. That is worse than the old distinction between "narrow" and "wide" builds. It would be a lazy and stupid design, and especially stupid since there really in no good alternative to Unicode today. ASCII is not even sufficient for American English, the whole Windows code page idea is a horrible mess, none of the legacy encodings are suitable for more than a tiny fraction of the world. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sat, Nov 22, 2014, at 21:11, Chris Angelico wrote: > Is that true? Does WriteConsoleW support every Unicode character? It's > not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe > something else). I was defining "every unicode character" loosely. There are certainly display problems (there are display problems with wide characters on non-CJK windows versions, too), but if you write a surrogate pair, you'll get something that can copy to the clipboard as a surrogate pair, and get the same thing that writing a non-BMP UTF-8 character with codepage 65001 will give you. And you certainly won't get an error. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014 at 12:52 PM, wrote: > On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote: >> ... >> That is a standard Windows build. He is again conflating problems with >> using the Windows command line for a given code page with the FSR. > > The thing is, with a truetype font selected, a correctly written win32 > console problem should be able to print any character without caring > about codepages (via use of WriteConsoleW instead of WriteFile). You > cannot rely on having the codepage set to 65001, especially since 65001 > isn't actually a fully supported codepage. Is that true? Does WriteConsoleW support every Unicode character? It's not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe something else). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote: > ... > That is a standard Windows build. He is again conflating problems with > using the Windows command line for a given code page with the FSR. The thing is, with a truetype font selected, a correctly written win32 console problem should be able to print any character without caring about codepages (via use of WriteConsoleW instead of WriteFile). You cannot rely on having the codepage set to 65001, especially since 65001 isn't actually a fully supported codepage. In my opinion it is a deficiency in the win32 support, rather than unicode support (and certainly nothing to do with the FSR), but it _is_ a deficiency. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote: > I really don't understand what bothers you about this. In Python, we have > Unicode strings and byte strings. In computing in general, strings can > consist of Unicode characters, ASCII characters, Tron characters, EBCDID > characters, ISO-8859-7 characters, and literally dozens of others. It > boogles my mind that you are so opposed to being explicit about what sort > of string we are dealing with. I think he means that it should be implementation-defined with an API that does not allow programs to make assumptions about the encoding, like C. To allow for implementations that use a different character set. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 22/11/2014 22:31, Chris Angelico wrote: On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence wrote: My favourite "find thousand and one ways to make Python crashing or failing." but I don't recall a single bug report in the last two years from anybody regarding problems with the FSR, or have I missed something? What you've missed is the grammar of the sentence you've (partially) quoted. Clearly he is seeking to make Python, and he is crashing or failing. My advice to him: Stop trying to build complex software while in command of a car. ChrisA What? The entire message follows. I think you are not understanding the point very well. Py32 and Qt derivative + plenty of dirty tricks. (It will probably not be rendered correctly.) Write something like this (an interactive interpreter) in Py32 and Py33 and see what happens: >>> print(999) 999 >>> sys.version '3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]' >>> # note the emoji and the private use area (plane 15) >>> a = 'abc\u00e9\u0153\u20ac\u1e9e\U0001f300\udb80\udc00z' >>> print(a) abc需ẞ🌀z >>> Note: it can be "cut/copied/pasted" with a MS product. jmf PS I have to recognized, I'm slowly getting tired to find thousand and one ways to make Python crashing or failing. That is a standard Windows build. He is again conflating problems with using the Windows command line for a given code page with the FSR. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence wrote: > My favourite "find thousand and one ways to make Python crashing or > failing." but I don't recall a single bug report in the last two years from > anybody regarding problems with the FSR, or have I missed something? What you've missed is the grammar of the sentence you've (partially) quoted. Clearly he is seeking to make Python, and he is crashing or failing. My advice to him: Stop trying to build complex software while in command of a car. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 22/11/2014 20:17, Chris Angelico wrote: On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence wrote: Please don't feed him. Your average troll is bad enough but he really takes the biscuit. ... someone was feeding him biscuits? ChrisA Surely it's better than feeding him unicode? As I needed cheering up I ventured over to gg and wasn't disappointed reading his latest rubbish. My favourite "find thousand and one ways to make Python crashing or failing." but I don't recall a single bug report in the last two years from anybody regarding problems with the FSR, or have I missed something? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence wrote: > Please don't feed him. Your average troll is bad enough but he really takes > the biscuit. ... someone was feeding him biscuits? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 22/11/2014 17:49, Marko Rauhamaa wrote: wxjmfa...@gmail.com: - By chance, I found on the web a German py dev who was commenting and he had not an updated "DUDEN" (a German dictionnary). That... leaves me utterly speachless! Marko Please don't feed him. Your average troll is bad enough but he really takes the biscuit. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
wxjmfa...@gmail.com: > - By chance, I found on the web a German py dev who was commenting and > he had not an updated "DUDEN" (a German dictionnary). That... leaves me utterly speachless! Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Saturday, November 22, 2014 8:14:15 PM UTC+5:30, Roy Smith wrote: > Marko Rauhamaa wrote: > > > Steven D'Aprano: > > > > > You haven't given any good reason for objecting to calling Unicode > > > strings by what they are. Maybe you think that it is an implementation > > > detail, and that some version of Python might suddenly and without > > > warning change to only supporting KOI8-R strings or GB2312 strings? If > > > so, you are badly mistaken. The fact that Python strings are Unicode > > > is not an implementation detail, it is part of the language semantics. > > > > To me, repeating the word Unicode everywhere is giving the (in and of > > itself impressive) standard too primary a status. While understanding > > how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very > > useful and occasionally can be taken explicit advantage of, those really > > are mundane techniques to implement abstractions. > > > > Python's strings exist (primarily) so you can express utterances in a > > human language, aka plain text. They don't exist to express Unicode code > > points. That would be putting the cart before the horse. > > > > > "Rectangular door" makes perfect sense, and in a world where there are > > > dozens of legacy non-rectangular doors, it would be very sensible to > > > specify the kind of door. > > > > It makes sense, and yet, I've never heard anyone talk about rectangular > > doors even though I use numerous doors every day. Why is it, then, that > > people feel the constant need to add the "Unicode" epithet to Python's > > strings, which -- according to its own specification -- are just > > strings? > > > > > > Marko > > There's a old joke to the effect that the fields of study which are > confident that they're really doing science (i.e. chemistry, biology, > physics, astronomy, etc) don't put the word "science" in their names. > It's only the fields of study that are less confident about their status > as sciences (computer science, behavioral science, political science, > etc) that feel the need to explicitly say "science". As if repeating it > enough times makes it true. I wonder if something of the same thing > applies here? > > Somewhat more seriously, the IEEE-754 point is quite apropos. Back when > 754 first came out, there were lots of different floating point > implementations. Machines that used 754 touted it in their sales > literature and mentioned it all over their documentation. These days, > 754 is so ubiquitous, nobody even thinks to mention it, in the same way > nobody bothers to mention 2's complement integers. I suspect that some > day, the same thing will happen with Unicode. For that matter, we will > eventually get to the point where when people say, "just plain text", > they will mean Unicode, in the same way that "just plain text" today > really means ASCII (and the text/plain MIME type will become a > historical curiosity). Yes this was my point also -- encodings in general and unicode in particular is a mess (as of 2014). Maybe in a few years the dust will settle. Then saying 'unicode' will become redundant. But until then when we have a rather leaky abstraction having sealing liquid on the hands is preferable to sewage in the house. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Roy Smith : > For that matter, we will eventually get to the point where when people > say, "just plain text", they will mean Unicode, in the same way that > "just plain text" today really means ASCII (and the text/plain MIME > type will become a historical curiosity). MIME has: Content-Type: text/plain; charset="UTF-8" (even though UTF-8 isn't a character set but a content encoding). Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
In article <87y4r348uf@elektro.pacujo.net>, Marko Rauhamaa wrote: > Steven D'Aprano : > > > You haven't given any good reason for objecting to calling Unicode > > strings by what they are. Maybe you think that it is an implementation > > detail, and that some version of Python might suddenly and without > > warning change to only supporting KOI8-R strings or GB2312 strings? If > > so, you are badly mistaken. The fact that Python strings are Unicode > > is not an implementation detail, it is part of the language semantics. > > To me, repeating the word Unicode everywhere is giving the (in and of > itself impressive) standard too primary a status. While understanding > how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very > useful and occasionally can be taken explicit advantage of, those really > are mundane techniques to implement abstractions. > > Python's strings exist (primarily) so you can express utterances in a > human language, aka plain text. They don't exist to express Unicode code > points. That would be putting the cart before the horse. > > > "Rectangular door" makes perfect sense, and in a world where there are > > dozens of legacy non-rectangular doors, it would be very sensible to > > specify the kind of door. > > It makes sense, and yet, I've never heard anyone talk about rectangular > doors even though I use numerous doors every day. Why is it, then, that > people feel the constant need to add the "Unicode" epithet to Python's > strings, which -- according to its own specification -- are just > strings? > > > Marko There's a old joke to the effect that the fields of study which are confident that they're really doing science (i.e. chemistry, biology, physics, astronomy, etc) don't put the word "science" in their names. It's only the fields of study that are less confident about their status as sciences (computer science, behavioral science, political science, etc) that feel the need to explicitly say "science". As if repeating it enough times makes it true. I wonder if something of the same thing applies here? Somewhat more seriously, the IEEE-754 point is quite apropos. Back when 754 first came out, there were lots of different floating point implementations. Machines that used 754 touted it in their sales literature and mentioned it all over their documentation. These days, 754 is so ubiquitous, nobody even thinks to mention it, in the same way nobody bothers to mention 2's complement integers. I suspect that some day, the same thing will happen with Unicode. For that matter, we will eventually get to the point where when people say, "just plain text", they will mean Unicode, in the same way that "just plain text" today really means ASCII (and the text/plain MIME type will become a historical curiosity). -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Steven D'Aprano : > You haven't given any good reason for objecting to calling Unicode > strings by what they are. Maybe you think that it is an implementation > detail, and that some version of Python might suddenly and without > warning change to only supporting KOI8-R strings or GB2312 strings? If > so, you are badly mistaken. The fact that Python strings are Unicode > is not an implementation detail, it is part of the language semantics. To me, repeating the word Unicode everywhere is giving the (in and of itself impressive) standard too primary a status. While understanding how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very useful and occasionally can be taken explicit advantage of, those really are mundane techniques to implement abstractions. Python's strings exist (primarily) so you can express utterances in a human language, aka plain text. They don't exist to express Unicode code points. That would be putting the cart before the horse. > "Rectangular door" makes perfect sense, and in a world where there are > dozens of legacy non-rectangular doors, it would be very sensible to > specify the kind of door. It makes sense, and yet, I've never heard anyone talk about rectangular doors even though I use numerous doors every day. Why is it, then, that people feel the constant need to add the "Unicode" epithet to Python's strings, which -- according to its own specification -- are just strings? Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sun, Nov 23, 2014 at 12:50 AM, Steven D'Aprano wrote: > "Tire car" makes no sense. "Rectangular door" makes perfect sense, and in a > world where there are dozens of legacy non-rectangular doors, it would be > very sensible to specify the kind of door. Just as we specify sliding door, > glass door, security door, fire door, flyscreen wire door, and so on. Not just legacy - scifi often has non-rectangular doors. (And they're often HORRIBLY impractical. I think the rectangular door is here to stay.) But English is a strange beast. A glass door is made of glass... a flyscreen wire door is made of (at least, has a significant component of) flyscreen, but a fire door isn't made of fire... ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: > Steven D'Aprano : > >> In Python, we have Unicode strings and byte strings. > > No, you don't. You have strings and bytes: Python has strings of Unicode code points, a.k.a. "Unicode strings", or "text strings", and strings of bytes, a.k.a. "byte strings". These are the plain English descriptive names of the types "str" and "bytes". > Textual data in Python is handled with str objects, or strings. > Strings are immutable sequences of Unicode code points. String > literals are written in a variety of ways: [...] Hence, Unicode string. > https://docs.python.org/3/library/stdtypes.html#text-sequence-typ > e-str> > > The core built-in types for manipulating binary data are bytes and > bytearray. Which are strings of bytes. > https://docs.python.org/3/library/stdtypes.html#binary-sequence-t > ypes-bytes-bytearray-memoryview > > > Equivalently, I wouldn't mind "character strings" vs "byte strings". Unicode strings are not strings of characters, except informally. Some code points represent non-characters: http://www.unicode.org/faq/private_use.html#nonchar1 They are strings of Unicode code points, but "code point string" is firstly an inelegant and ugly phrase, and secondly ambiguous. What sort of code points? Baudot codes? ASCII codes? Big5 codes? Tron codes? No, none of the above, they are *Unicode* code points. You haven't given any good reason for objecting to calling Unicode strings by what they are. Maybe you think that it is an implementation detail, and that some version of Python might suddenly and without warning change to only supporting KOI8-R strings or GB2312 strings? If so, you are badly mistaken. The fact that Python strings are Unicode is not an implementation detail, it is part of the language semantics. > Unicode strings is not wrong but the technical emphasis on Unicode is as > strange as a "tire car" or "rectangular door" when "car" and "door" are > what you usually mean. "Tire car" makes no sense. "Rectangular door" makes perfect sense, and in a world where there are dozens of legacy non-rectangular doors, it would be very sensible to specify the kind of door. Just as we specify sliding door, glass door, security door, fire door, flyscreen wire door, and so on. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Steven D'Aprano : > In Python, we have Unicode strings and byte strings. No, you don't. You have strings and bytes: Textual data in Python is handled with str objects, or strings. Strings are immutable sequences of Unicode code points. String literals are written in a variety of ways: [...] https://docs.python.org/3/library/stdtypes.html#text-sequence-typ e-str> The core built-in types for manipulating binary data are bytes and bytearray. https://docs.python.org/3/library/stdtypes.html#binary-sequence-t ypes-bytes-bytearray-memoryview Equivalently, I wouldn't mind "character strings" vs "byte strings". Unicode strings is not wrong but the technical emphasis on Unicode is as strange as a "tire car" or "rectangular door" when "car" and "door" are what you usually mean. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: > Rustom Mody : > >> Likewise in 2014, and given the arguments, inconsistencies, etc >> remembering the nuts-n-bolts below the strings-represented-as-unicode >> abstraction may be in order. > > No need to hide Unicode, but talking about a > >Unicode string > > is like talking about an > >electronic computer versus a hydraulic computer, a mechanical computer, an optical computer, a human computer, a genetic (DNA) computer, ... >visible spectrum display I'm not sure that many people actually do refer to "visible spectrum display", or what you mean by it, but I can easily imagine that being in contrast with a non-visible spectrum display. >mouse user interface As opposed to a commandline user interface, direct brain-to-computer user interface, touch UI, etc. Not to mention non-user interfaces, like SCSI interface, SATA interface, USB interface, ... >ethernet socket Telephone socket, Appletalk socket, Firewire socket, ADB socket ... >magnetic file I have no idea what you mean here. Do you mean magnetic *field*? As opposed to an electric field, gravitational field, Higgs field, strong nuclear force field, weak nuclear force field ... >electric power supply > > The language spec calls the things just "strings," as it should. I really don't understand what bothers you about this. In Python, we have Unicode strings and byte strings. In computing in general, strings can consist of Unicode characters, ASCII characters, Tron characters, EBCDID characters, ISO-8859-7 characters, and literally dozens of others. It boogles my mind that you are so opposed to being explicit about what sort of string we are dealing with. Are you equally disturbed when people distinguish between tablespoon, teaspoon, dessert spoon and serving spoon? -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sat, Nov 22, 2014 at 3:36 AM, Marko Rauhamaa wrote: > No need to hide Unicode, but talking about a > >Unicode string > > is like talking about an > >electronic computer > >visible spectrum display > >mouse user interface > >ethernet socket > >magnetic file > >electric power supply > > The language spec calls the things just "strings," as it should. I'm not sure what you mean here, because the adjectives all cut out other common constructs - a byte string, an analog computer, an IR or UV display, a blind-compatible UI, a Unix domain socket, an in-memory file, and a diesel power supply. Okay, I'm pushing it with the last one (they're usually called gen sets, not power supplies), and I don't often hear people talk about "magnetic files", but the rest are definitely valid comparison/contrast terms. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Rustom Mody : > Likewise in 2014, and given the arguments, inconsistencies, etc > remembering the nuts-n-bolts below the strings-represented-as-unicode > abstraction may be in order. No need to hide Unicode, but talking about a Unicode string is like talking about an electronic computer visible spectrum display mouse user interface ethernet socket magnetic file electric power supply The language spec calls the things just "strings," as it should. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sat, Nov 22, 2014 at 3:11 AM, Francis Moreau wrote: > Yes I finally used str() since only setlocale() reported to have some > issues with unicode_literals active in my appliction. > > Thanks Chris for your useful insight. My pleasure. Unicode is a bit of a hobby-horse of mine, so I'm always happy to see people getting things right :) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 11/20/2014 04:15 PM, Chris Angelico wrote: > On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau > wrote: >> Hi, >> >> Thanks for the "from __future__ import unicode_literals" trick, it makes >> that switch much less intrusive. >> >> However it seems that I will suddenly be trapped by all modules which >> are not prepared to handle unicode. For example: >> >> >>> from __future__ import unicode_literals >> >>> import locale >> >>> locale.setlocale(locale.LC_ALL, 'fr_FR') >> Traceback (most recent call last): >>File "", line 1, in >>File "/usr/lib64/python2.7/locale.py", line 546, in setlocale >> locale = normalize(_build_localename(locale)) >>File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename >> language, encoding = localetuple >> ValueError: too many values to unpack >> >> Is the locale module an exception and in that case I'll fix it by doing: >> >> >>> locale.setlocale(locale.LC_ALL, b'fr_FR') >> >> or is a (big) part of the modules in python 2.7 still not ready for >> unicode and in that case I have to decide which prefix (u or b) I should >> manually add ? > > Sadly, there are quite a lot of parts of Python 2 that simply don't > handle Unicode strings. But you can probably keep all of those down to > just a handful of explicit b"whatever" strings; most places should > accept unicode as well as str. What you're seeing here is a prime > example of one of this author's points (caution, long post): > > http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ > > """The lesson of Python 3 is: give programmers a Unicode string type, > *make it the default*, and encoding issues will /mostly/ go away.""" > > There's a whole ecosystem to Python 2 - some in the standard library, > heaps more in the rest of the world - and a lot of it was written on > the assumption that a byte is a character is an octet. When you pass > Unicode strings to functions written to expect byte strings, sometimes > you win, and sometimes you lose... even with the standard library > itself. But the Python 3 ecosystem has been written on the assumption > that strings are Unicode. It's only a narrow set of programs > ("boundary code", where you're moving text across networks and stuff > like that) where the Python 2 model is easier to work with; and the > recent Py3 releases have been progressively working to relieve that > pain. > > The absolute worst case is a function which exists in Python 2 and 3, > and requires a byte string in Py2 and a text string in Py3. Sadly, > that may be exactly what locale.setlocale() is. For that, I would > suggest explicitly passing stuff through str(): > > locale.setlocale(locale.LC_ALL, str('fr_FR')) > > In Python 3, 'fr_FR' is already a str, so passing it through str() > will have no significant effect. (Though it would be worth commenting > that, to make it clear to a subsequent reader that this is Py2 compat > code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode, > so passing it through str() will encode it to ASCII, producing a byte > string that setlocale should be happy with. > > By the way, the reason for the strange error message is clearer in > Python 3, which chains in another exception: > locale.setlocale(locale.LC_ALL, b'fr_FR') > Traceback (most recent call last): > File "/usr/local/lib/python3.5/locale.py", line 498, in _build_localename > language, encoding = localetuple > ValueError: too many values to unpack (expected 2) > > During handling of the above exception, another exception occurred: > > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/lib/python3.5/locale.py", line 594, in setlocale > locale = normalize(_build_localename(locale)) > File "/usr/local/lib/python3.5/locale.py", line 507, in _build_localename > raise TypeError('Locale must be None, a string, or an iterable of > two strings -- language code, encoding.') > TypeError: Locale must be None, a string, or an iterable of two > strings -- language code, encoding. > > So when it gets the wrong type of string, it attempts to unpack it as > an iterable; it yields five values (the five bytes or characters, > depending on which way it's the wrong type of string), but it's > expecting two. Fortunately, str() will deal with this. But make sure > you don't have the b prefix, or str() in Py3 will give you quite a > different result! > Yes I finally used str() since only setlocale() reported to have some issues with unicode_literals active in my appliction. Thanks Chris for your useful insight. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Friday, November 21, 2014 12:06:54 PM UTC+5:30, Marko Rauhamaa wrote: > Chris Angelico : > > > On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa wrote: > >> I don't really like it how Unicode is equated with text, or even > >> character strings. > > [...] > > Do you have actual text that you're unable to represent in Unicode? > > Not my point at all. > > I'm saying equating an abstract data type (string) with its > representation (Unicode vector) is bad taste. > > > We don't call numbers IEEE, > > Exactly. > > > Do you genuinely have text that you can't represent in Unicode, or are > > you just arguing against Unicode to try to justify "Python strings are > > " as a basis for your code? > > Nobody is arguing against Unicode. I'm saying, let's talk about the > forest instead of the trees (except when the trees really are the > focus). Ive always felt the makers of C showed remarkably good taste in the names 'int' and 'float'. Unlike: Pascal: Int and Real PL/1: Fixed and Float IOW the more leaky abstraction used for real numbers is explicitly reminded. Likewise in 2014, and given the arguments, inconsistencies, etc remembering the nuts-n-bolts below the strings-represented-as-unicode abstraction may be in order. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 2014-11-22 02:23, Steven D'Aprano wrote: > LATIN SMALL LETTER E > COMBINING CIRCUMFLEX ACCENT > > then my application should treat that as a single "character" and > display it as: > > LATIN SMALL LETTER E WITH CIRCUMFLEX > > which looks like this: ê > > rather than two distinct "characters" eˆ > > Now, that specific example is a no-brainer, because the Unicode > normalization routines will handle the conversion. But not every > combination of accented characters has a canonical combined form. > What about something like this? > > 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING > CARON}' > > If I insert a character into my string, I want to be able to insert > before the w or after the caron, but not in the middle of those > three code points. Things get even weirder if you have '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}\N{COMBINING CARON}' and when you try to do comparisons like s1 = '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}' s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}' s3 = 'e\N{COMBINING OGONEK}\N{COMBINING CIRCUMFLEX ACCENT}' print(s1 == s2) print(s1 == s3) print(s2 == s3) Then you also have the case where you want to edit text and the user wants to remove the COMBINING OGONEK from the character, so you *do* want to do something akin to s4 = ''.join(c for c in s3 if c != '\N{COMBINING OGONEK}') And yet, weird things happen if you try to remove the circumflex: for test in (s1, s2, s3): print(test == ''.join( c for c in test if c != '\N{COMBINING CIRCUMFLEX ACCENT}' ) They all make sense if you understand what's going on under the hood, but from a visual/conceptual perspective, something feels amiss. -tkc -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Sat, Nov 22, 2014 at 2:23 AM, Steven D'Aprano wrote: > Chris Angelico wrote: > >> On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano >> wrote: >>> (E.g. there are millions of existing files across the world containing >>> text which use legacy encodings that are not compatible with Unicode.) >> >> Not compatible with Unicode? There aren't many character sets out >> there that include characters not in Unicode - that was the whole >> point. Of course, there are plenty of files in unspecified eight-bit >> encodings, so you may have a problem with reliable decoding - but if >> you know what the encoding is, you ought to be able to represent each >> character in Unicode. > > What I meant was that some encodings -- namely ASCII and Latin-1 -- the > ordinals are exactly equivalent to Unicode, that is: > > That's not quite as significant as I thought, though. What is significant is > that a pure ASCII file on disk can be read by a program assuming UTF-8: > > although the same is not the case for Latin-1 encoded files. Yep. Thing is, Unicode can't magically convert all files on all disks... but with a good codec library, you can at least convert things as you find them. (I was reading MacRoman files earlier this year. THAT is an encoding I didn't expect I'd find in 2014.) > Well, yes. My point, agreeing with Marko, is that any time you want to do > something even vaguely related to human-readable text, "code points" are > not enough. ... What about something like this? > > 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}' > > If I insert a character into my string, I want to be able to insert before > the w or after the caron, but not in the middle of those three code points. Yes, which is a concern. Also a concern is the ability to detect other boundaries, like words. None of these can be easily solved; all of them can be dealt with by using the Unicode character data, which is better than you get for most legacy encodings. In terms of Python strings, it still makes sense to insert characters between those combining characters; so what you're saying is that a text editor widget needs to be aware of more than just code points. Which is trivially obvious in the presence of RTL text, too; cursor positions through differing-direction text will be an issue. The problems you're citing aren't Unicode problems. They stem from the complexities of human languages. Unicode just makes them a bit more visible to English-only-speaking programmers. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico wrote: > On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano > wrote: >> (E.g. there are millions of existing files across the world containing >> text which use legacy encodings that are not compatible with Unicode.) > > Not compatible with Unicode? There aren't many character sets out > there that include characters not in Unicode - that was the whole > point. Of course, there are plenty of files in unspecified eight-bit > encodings, so you may have a problem with reliable decoding - but if > you know what the encoding is, you ought to be able to represent each > character in Unicode. What I meant was that some encodings -- namely ASCII and Latin-1 -- the ordinals are exactly equivalent to Unicode, that is: # Python 3 for i in range(128): assert chr(i).encode('ASCII') == bytes([i]) for i in range(256): assert chr(i).encode('Latin-1') == bytes([i]) That's not quite as significant as I thought, though. What is significant is that a pure ASCII file on disk can be read by a program assuming UTF-8: for i in range(128): assert chr(i).encode('UTF-8') == bytes([i]) although the same is not the case for Latin-1 encoded files. > Not compatible with any of the UTFs, that's different. Plenty of that > in the world. > >> You are certainly correct that in it's full generality, "text" is much >> more than just a string of code points. Unicode strings is a primitive >> data type. A powerful and sophisticated text processing application may >> even find Python strings too primitive, possibly needing something like >> ropes of graphemes rather than strings of code points. > > That's probably more an efficiency point, though. It should be > possible to do a perfect two-way translation between your grapheme > rope and a Python string; otherwise, you'll have great difficulty > saving your file to the disk (which will normally involve representing > the text in Unicode, then encoding that to bytes). Well, yes. My point, agreeing with Marko, is that any time you want to do something even vaguely related to human-readable text, "code points" are not enough. For example, if I give a string containing the following two code points in this order: LATIN SMALL LETTER E COMBINING CIRCUMFLEX ACCENT then my application should treat that as a single "character" and display it as: LATIN SMALL LETTER E WITH CIRCUMFLEX which looks like this: ê rather than two distinct "characters" eˆ Now, that specific example is a no-brainer, because the Unicode normalization routines will handle the conversion. But not every combination of accented characters has a canonical combined form. What about something like this? 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}' If I insert a character into my string, I want to be able to insert before the w or after the caron, but not in the middle of those three code points. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 7:16 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> Then you need to read more about Unicode. The *codepoint* for the >> letter 'A' is 65. That is not Unicode, that is one part of the Unicode >> spec. > > I don't think Python users need to know anything more about Unicode than > they need to know about IEEE-754. > > How many bits are reserved for the mantissa? I don't remember and I > don't see why I should care. At what point can a Python float no longer represent every integer? That's why you should care. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico : > Then you need to read more about Unicode. The *codepoint* for the > letter 'A' is 65. That is not Unicode, that is one part of the Unicode > spec. I don't think Python users need to know anything more about Unicode than they need to know about IEEE-754. How many bits are reserved for the mantissa? I don't remember and I don't see why I should care. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 6:14 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa wrote: >>> I'm saying equating an abstract data type (string) with its >>> representation (Unicode vector) is bad taste. >> >> What about "sequence of Unicode code points" is "representation"? What >> is your abstraction over that? > > The letter 'A' is a character. Unicode for the letter 'A' is 65. It is > very rarely that you care about that number. You are only interested in > the letter 'A', which you can use to spell people's names, for instance. > > When you read a book, you read the text, not the ink. Then you need to read more about Unicode. The *codepoint* for the letter 'A' is 65. That is not Unicode, that is one part of the Unicode spec. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico : > On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa wrote: >> I'm saying equating an abstract data type (string) with its >> representation (Unicode vector) is bad taste. > > What about "sequence of Unicode code points" is "representation"? What > is your abstraction over that? The letter 'A' is a character. Unicode for the letter 'A' is 65. It is very rarely that you care about that number. You are only interested in the letter 'A', which you can use to spell people's names, for instance. When you read a book, you read the text, not the ink. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa wrote: > Chris Angelico : > >> On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa wrote: >>> I don't really like it how Unicode is equated with text, or even >>> character strings. >> [...] >> Do you have actual text that you're unable to represent in Unicode? > > Not my point at all. > > I'm saying equating an abstract data type (string) with its > representation (Unicode vector) is bad taste. What about "sequence of Unicode code points" is "representation"? What is your abstraction over that? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico : > On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa wrote: >> I don't really like it how Unicode is equated with text, or even >> character strings. > [...] > Do you have actual text that you're unable to represent in Unicode? Not my point at all. I'm saying equating an abstract data type (string) with its representation (Unicode vector) is bad taste. > We don't call numbers IEEE, Exactly. > Do you genuinely have text that you can't represent in Unicode, or are > you just arguing against Unicode to try to justify "Python strings are > " as a basis for your code? Nobody is arguing against Unicode. I'm saying, let's talk about the forest instead of the trees (except when the trees really are the focus). Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 12:31 PM, wrote: > On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote: >> 2) Languages which use a different alphabet (eg Cyrillic - Russian, >> Bulgarian). You could possibly cram them into an eight-bit encoding >> without tipping ASCII out, but I'm not sure. In Unicode, these >> languages are all easily supported by the BMP, as they don't use a >> huge number of characters each. > > There are numerous eight-bit encodings that support latin and one other > alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit > encoding is basically two seven-bit encodings. I'm aware of this; Greek, for instance, fits quite happily into ISO-8859-7, which is eight-bit. > The most difficult (of those still possible at all) language to encode > in eight bits is actually Vietnamese, which uses the Latin alphabet, due > to the sheer number of accented letters used. Windows' encoding of it > (along with some other lesser used encodings, all for Vietnamese) is the > only 8-bit encoding to use combining accents, in a way unfortunately > incompatible with unicode normalization if naively translated, whereas > VISCII sacrifices a handful of C0 control characters in addition to > fully packing the high half with letters. This is what I was suspicious of. The very notion of "combining accents" already breaks the notion that "a byte is a character is a glyph", which most eight-bit encodings try to pretend. In any case, the BMP still easily copes with them all. (Hmm. I wonder how you'd typeset the old "Self-Pronouncing Alphabet" for English? It's basically English text with a few markings added to letters - not standard diacriticals that already exist in Unicode, but dots. Probably possible, one way or another... but I haven't seen SPA text since the 90s, and that was in stuff published back in the 80s or so.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote: > 2) Languages which use a different alphabet (eg Cyrillic - Russian, > Bulgarian). You could possibly cram them into an eight-bit encoding > without tipping ASCII out, but I'm not sure. In Unicode, these > languages are all easily supported by the BMP, as they don't use a > huge number of characters each. There are numerous eight-bit encodings that support latin and one other alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit encoding is basically two seven-bit encodings. The most difficult (of those still possible at all) language to encode in eight bits is actually Vietnamese, which uses the Latin alphabet, due to the sheer number of accented letters used. Windows' encoding of it (along with some other lesser used encodings, all for Vietnamese) is the only 8-bit encoding to use combining accents, in a way unfortunately incompatible with unicode normalization if naively translated, whereas VISCII sacrifices a handful of C0 control characters in addition to fully packing the high half with letters. -- Random832 -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano wrote: > (E.g. there are millions of existing files across the world containing text > which use legacy encodings that are not compatible with Unicode.) Not compatible with Unicode? There aren't many character sets out there that include characters not in Unicode - that was the whole point. Of course, there are plenty of files in unspecified eight-bit encodings, so you may have a problem with reliable decoding - but if you know what the encoding is, you ought to be able to represent each character in Unicode. Not compatible with any of the UTFs, that's different. Plenty of that in the world. > You are certainly correct that in it's full generality, "text" is much more > than just a string of code points. Unicode strings is a primitive data > type. A powerful and sophisticated text processing application may even > find Python strings too primitive, possibly needing something like ropes of > graphemes rather than strings of code points. That's probably more an efficiency point, though. It should be possible to do a perfect two-way translation between your grapheme rope and a Python string; otherwise, you'll have great difficulty saving your file to the disk (which will normally involve representing the text in Unicode, then encoding that to bytes). To be sure, a Python string is a poor representational form for a text editor. But that's largely because it's immutable, so every little edit would involve massive copying. Depending on what you're doing, it might be worth using a chunked UTF-8 byte stream (allowing for insertion at any chunk boundary), or an array of lines, or something grapheme-based... but all of those questions are performance, not correctness, issues. > We Western and Northern European speakers -- and I don't know whether Finns > are counted as Northern Europeans or Eastern Europeans -- are lucky in that > our natural languages are well-covered by Unicode. All our graphemes are > also code points, even the "funny ones with accents". As an English > speaker. I have to remind myself that not every grapheme is a single code > point, but Devanagari or Navajo writers will never make that mistake. I've been working with different languages a bit, lately. Broadly speaking, you have: 1) Languages which use the Roman alphabet, plus a handful of other characters (eg Finnish, German). These can be represented largely in ASCII, and used to be handled fairly easily with a single codepage - an eight-bit ASCII-compatible encoding. 2) Languages which use a different alphabet (eg Cyrillic - Russian, Bulgarian). You could possibly cram them into an eight-bit encoding without tipping ASCII out, but I'm not sure. In Unicode, these languages are all easily supported by the BMP, as they don't use a huge number of characters each. 3) Languages which use a non-alphabetic system (eg Korean). I think they're all still covered by the BMP, but there's no way you can fit them into eight-bit encodings - one single language will use more than 256 symbols. 4) Ancient, esoteric, or symbolic writing systems. Not fundamentally different from the above categories except that they're less used, and the BMP has finite space. These will definitely need the SMP. But all of them are covered by Unicode. (Sadly, they are NOT all covered by all fonts, so I've been finding that certain pieces of text come out as strings of little boxes. But I can at least manipulate the text, even if I can't read it back.) I can, for example, zip lines of text like this: English: Let it go, let it go! I am one with the wind and sky Let it go, let it go! You'll never see me cry! Icelandic: Þetta er nóg, þetta er nóg Uppi í himni eins og vindablær Þetta er nóg, komið nóg Og tár mín enginn sér fær Russian: Отпусти и забудь, Этот мир из твоих грёз. Отпусти и забудь, И не будет больше слёз. Output: Let it go, let it go! Þetta er nóg, þetta er nóg Отпусти и забудь, I am one with the wind and sky Uppi í himni eins og vindablær Этот мир из твоих грёз. Let it go, let it go! Þetta er nóg, komið nóg Отпусти и забудь, You'll never see me cry! Og tár mín enginn sér fær И не будет больше слёз. In fact, it's trivially easy to write something like this, because all this text is Unicode. ALL of these languages (and plenty more) are "well-covered by Unicode". There's still the ongoing debate of Han unification, plus the progressive work of adding characters for ancient scripts and such, but AFAIK, all writing systems currently in use are covered. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Marko Rauhamaa wrote: > Michael Torrie : > >> Unicode can only be encoded to bytes. >> Bytes can only be decoded to unicode. > > I don't really like it how Unicode is equated with text, or even > character strings. That surely depends on the context. To be technically correct, Unicode is a character set together with a set of rules for dealing with them (e.g. rules for uppercasing characters, sorting rules, etc.). When referring to the standard, "Unicode" is a noun; when referring to text, it is actually an adjective being used as a noun. That is, "Unicode text" has become abbreviated as just "Unicode" in much the same way as "human beings" has become abbreviated as just "humans". In that sense, "text is Unicode" just means "in the context in which we are talking, when I say 'text' I mean 'Unicode text' as opposed to (for example) 'ASCII text' or 'KOI-8 text'." It certainly doesn't mean that *all* text in other contexts are Unicode, since that is obviously untrue. (E.g. there are millions of existing files across the world containing text which use legacy encodings that are not compatible with Unicode.) > There's barely any difference between the truth value of these > statements: > >Python strings are ASCII. > >Python strings are Latin-1. > >Python strings are Unicode. > > Each of those statements is true as long as you stay within the > respective character sets, and cease to be true when your text contains > characters outside the character sets. When we say "Python strings are FOO", we are making a statement about arbitrary Python strings, not a particular set of concrete examples of strings. If Python strings are FOO, that means that for all possible Python strings s, "s is FOO" is a true statement. We cannot say that Python strings are uppercase, because we can easily find counter-examples such as 'xyz'. Likewise we cannot say Python strings are ASCII, or Latin-1, because we can easily find counter-examples such as 'Ř' On the other hand, Python strings *are* Unicode, because by design Python strings are limited to Unicode. Every Python string is a Unicode string. > Now, it is true that Python currently limits itself to the 1,114,112 > Unicode code points. And it likely won't adopt more characters unless > Unicode does it first. However, text is something more lofty and > abstract than a sequence of Unicode code points. You are certainly correct that in it's full generality, "text" is much more than just a string of code points. Unicode strings is a primitive data type. A powerful and sophisticated text processing application may even find Python strings too primitive, possibly needing something like ropes of graphemes rather than strings of code points. We Western and Northern European speakers -- and I don't know whether Finns are counted as Northern Europeans or Eastern Europeans -- are lucky in that our natural languages are well-covered by Unicode. All our graphemes are also code points, even the "funny ones with accents". As an English speaker. I have to remind myself that not every grapheme is a single code point, but Devanagari or Navajo writers will never make that mistake. > We shouldn't call strings Unicode any more than we call numbers IEEE or > times ISO. We certainly shouldn't call numbers IEEE, but we might very well call them IEEE-754. Actually, since IEEE-754 covers multiple formats, we have to be more specific: Python floats are IEEE-754 double-precision binary floats. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa wrote: > Michael Torrie : > >> Unicode can only be encoded to bytes. >> Bytes can only be decoded to unicode. > > I don't really like it how Unicode is equated with text, or even > character strings. > > There's barely any difference between the truth value of these > statements: > >Python strings are ASCII. > >Python strings are Latin-1. > >Python strings are Unicode. > > Each of those statements is true as long as you stay within the > respective character sets, and cease to be true when your text contains > characters outside the character sets. The difference is that ASCII and Latin-1 cut out a large number of active world languages, UCS-2 (the intermediate option you didn't mention) cuts out a small proportion (by usage) of significant characters, and Unicode cuts out only those characters which fall under issues like Han unification. (Plus any that haven't yet been allocated. But since Python doesn't actually validate code points to ensure that they've been given meanings, you can use today's Python to work with tomorrow's Unicode.) Do you have actual text that you're unable to represent in Unicode? If so, you are going to have major problems using it with *any* computer system. There are Japanese encodings that can represent additional characters, but they also *cannot* represent a lot of the other characters we use, so there'll be fundamental incompatibilities. > Now, it is true that Python currently limits itself to the 1,114,112 > Unicode code points. And it likely won't adopt more characters unless > Unicode does it first. However, text is something more lofty and > abstract than a sequence of Unicode code points. > > We shouldn't call strings Unicode any more than we call numbers IEEE or > times ISO. We don't call numbers IEEE, but if we're working with Python floats, we *do* require all numbers to be representable as IEEE floating-point. Don't like that? Pick decimal.Decimal instead, or fractions.Fraction, and pick a different set of limitations... but ultimately, you *will* have restrictions - and much tighter restrictions than Unicode places on text. Do you genuinely have text that you can't represent in Unicode, or are you just arguing against Unicode to try to justify "Python strings are " as a basis for your code? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 4:42 AM, wrote: > On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote: >> >> Why should it encode to bytes? > > Because a bytes format string suggests a bytes result. Why does unicode > always "win", rather than the type of the format string always winning? For the same reason that float always "wins": >>> 1.0 + 2 3.0 >>> 1 + 2.0 3.0 >> Makes much better sense to work in >> Unicode. But mainly, it has to do one of them, and be predictable. > > Yeah, but string % is not a symmetrical operator. People's mental model > of it is likely to be that it acts like format (which does use the type > of the format string) or C sprintf/wsprintf (both of which use the same > type for the format string and result). And literally every other type > is converted to the type of the format string when used with %s - having > unicode be special adds cognitive load, and it means you can't safely > blindly use %s with an unknown object. True, but Python 2 deliberately lets you conflate the two, so you get a bit of convenience at the expensive of complexity when things go wrong. Python 3, on the other hand, is much more careful about the difference: >>> "asdf %s qwer" % b"zxcv" "asdf b'zxcv' qwer" >>> b"asdf %s qwer" % "zxcv" Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for %: 'bytes' and 'str' So your complaint *has* been resolved... but only in Python 3, because the change would break stuff. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014, at 16:29, Ethan Furman wrote: > If your unicode string happens to contain a base64 encoded .png, then you > could decode that into bytes. ;) Bytes of the PNG, or of the raw pixels? -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Ethan Furman : > If your unicode string happens to contain a base64 encoded .png, then > you could decode that into bytes. ;) You could embed your PNG file in XML in binary form as CDATA. Then, your "characters" would represent 8- or 16-bit integers. You just need to replace all accidental occurrences of ]]> with ]]>]]>
Re: python 2.7 and unicode (one more time)
On 11/20/2014 07:53 AM, Chris Angelico wrote: > On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: >> I think that you may get a Unicode/Encode/Error when you try to /decode/ a >> unicode string is more confusing... > > Hang on a minute, what does it even mean to decode a Unicode string? > That's where the problem is. Fortunately that's one that Py3 solved - > str simply doesn't have a decode() method. If your unicode string happens to contain a base64 encoded .png, then you could decode that into bytes. ;) -- ~Ethan~ signature.asc Description: OpenPGP digital signature -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 20/11/2014 18:06, Ian Kelly wrote: On Thu, Nov 20, 2014 at 10:42 AM, wrote: and it means you can't safely blindly use %s with an unknown object. You can't safely do this anyway. Whether it's %s with a str and a unicode, or %s with a unicode and a str, *something* is going to have to be implicitly encoded or decoded, and if ascii doesn't happen to be the correct encoding then the result will be either an error or a silent failure. All I know about this encoding/decoding malarky is that I'd prefer an error to a silent failure any day of the week. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Michael Torrie : > Unicode can only be encoded to bytes. > Bytes can only be decoded to unicode. I don't really like it how Unicode is equated with text, or even character strings. There's barely any difference between the truth value of these statements: Python strings are ASCII. Python strings are Latin-1. Python strings are Unicode. Each of those statements is true as long as you stay within the respective character sets, and cease to be true when your text contains characters outside the character sets. Now, it is true that Python currently limits itself to the 1,114,112 Unicode code points. And it likely won't adopt more characters unless Unicode does it first. However, text is something more lofty and abstract than a sequence of Unicode code points. We shouldn't call strings Unicode any more than we call numbers IEEE or times ISO. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
random...@fastmail.us wrote: > On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote: >> On Fri, Nov 21, 2014 at 12:59 AM, wrote: >> > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote: >> >> >>> "%s nötig %s" % (u"üblich", u"ähnlich") >> >> Traceback (most recent call last): >> >> File "", line 1, in >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position >> >> 4: ordinal not in range(128) >> > >> > This is surprising to me - why is it trying to decode the format >> > string, rather than encode the arguments? >> >> Why should it encode to bytes? > > Because a bytes format string suggests a bytes result. Why does unicode > always "win", rather than the type of the format string always winning? My guess is that when unicode was introduced the decision to propagate str to unicode in some cases was made because the developers expected that more old code that was unaware of unicode would continue to work. The old methods __mod__(), replace(), and join() that conceptually deal with strings propate while those that deal with characters -- center(), r/ljust(), translate() -- dont. The newer format() method doesn't propagate which is probably due to a change in attitude rather than an oversight. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014 at 11:06 AM, Ian Kelly wrote: > On Thu, Nov 20, 2014 at 10:42 AM, wrote: >> and it means you can't safely >> blindly use %s with an unknown object. > > You can't safely do this anyway. Whether it's %s with a str and a > unicode, or %s with a unicode and a str, *something* is going to have > to be implicitly encoded or decoded, and if ascii doesn't happen to be > the correct encoding then the result will be either an error or a > silent failure. Also note that if you use %r instead of %s, you'll get the result you want (although the unicode string will be quoted rather than encoded). -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014 at 10:42 AM, wrote: > and it means you can't safely > blindly use %s with an unknown object. You can't safely do this anyway. Whether it's %s with a str and a unicode, or %s with a unicode and a str, *something* is going to have to be implicitly encoded or decoded, and if ascii doesn't happen to be the correct encoding then the result will be either an error or a silent failure. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote: > On Fri, Nov 21, 2014 at 12:59 AM, wrote: > > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote: > >> >>> "%s nötig %s" % (u"üblich", u"ähnlich") > >> Traceback (most recent call last): > >> File "", line 1, in > >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: > >> ordinal not in range(128) > > > > This is surprising to me - why is it trying to decode the format string, > > rather than encode the arguments? > > Why should it encode to bytes? Because a bytes format string suggests a bytes result. Why does unicode always "win", rather than the type of the format string always winning? > Makes much better sense to work in > Unicode. But mainly, it has to do one of them, and be predictable. Yeah, but string % is not a symmetrical operator. People's mental model of it is likely to be that it acts like format (which does use the type of the format string) or C sprintf/wsprintf (both of which use the same type for the format string and result). And literally every other type is converted to the type of the format string when used with %s - having unicode be special adds cognitive load, and it means you can't safely blindly use %s with an unknown object. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico wrote: > On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten <__pete...@web.de> wrote: >> Chris Angelico wrote: >> >>> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: I think that you may get a Unicode/Encode/Error when you try to /decode/ a unicode string is more confusing... >>> >>> Hang on a minute, what does it even mean to decode a Unicode string? >> >> Let's not get philosophical ;) > > No, I'm quite serious. I'm sorry I'm limited to text, otherwise I would have formatted the ";)" as 30pt blinking magenta... > You encode Unicode text into bytes; you decode > bytes into text. You can also encode a floating-point value into > bytes, and decode bytes into a float. Or you could encode a large and > complex structure into bytes, using something like pickle or json, and > then decode those bytes later on. The pattern is always the same: the > abstract object with meaning to a human is encoded into a concrete > form that a computer can handle, and the concrete is decoded into the > abstract. If you're not good at sight-reading sheet music, you'll have > the same feeling of staring at the dots, decoding them one by one into > this abstract thing called "music", and then being able to work with > it. > > When you try to decode a Unicode string, what happens is that Python 2 > says "Oh, you're trying to do a byte-string operation on a Unicode > string... I'll quickly encode that to bytes for you, then do what you > asked". That's why you can get an *en*coding error when you asked to > *de*code - because both operations have to happen. In an alternative universe unicode.decode() could have been implemented as a no-op. As you put it it looks like you have to find the true nature of the problem and then cast it into code -- a kind of essentialism. I would rather emphasise the process; the evolving interface changes your view on the underlying problem -- a hermeneutic cycle if you will. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On 11/20/2014 09:32 AM, Peter Otten wrote: > Chris Angelico wrote: > >> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: >>> I think that you may get a Unicode/Encode/Error when you try to /decode/ >>> a unicode string is more confusing... >> >> Hang on a minute, what does it even mean to decode a Unicode string? > > Let's not get philosophical ;) It's not philosophical. It's an important distinction that folks need to be clear on when understanding unicode and the errors that python can throw. Unicode can only be encoded to bytes. Bytes can only be decoded to unicode. Without understanding that, the exception errors about decoding won't be properly understood, nor will one know how to fix them. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten <__pete...@web.de> wrote: > Chris Angelico wrote: > >> On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: >>> I think that you may get a Unicode/Encode/Error when you try to /decode/ >>> a unicode string is more confusing... >> >> Hang on a minute, what does it even mean to decode a Unicode string? > > Let's not get philosophical ;) No, I'm quite serious. You encode Unicode text into bytes; you decode bytes into text. You can also encode a floating-point value into bytes, and decode bytes into a float. Or you could encode a large and complex structure into bytes, using something like pickle or json, and then decode those bytes later on. The pattern is always the same: the abstract object with meaning to a human is encoded into a concrete form that a computer can handle, and the concrete is decoded into the abstract. If you're not good at sight-reading sheet music, you'll have the same feeling of staring at the dots, decoding them one by one into this abstract thing called "music", and then being able to work with it. When you try to decode a Unicode string, what happens is that Python 2 says "Oh, you're trying to do a byte-string operation on a Unicode string... I'll quickly encode that to bytes for you, then do what you asked". That's why you can get an *en*coding error when you asked to *de*code - because both operations have to happen. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Chris Angelico wrote: > On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: >> I think that you may get a Unicode/Encode/Error when you try to /decode/ >> a unicode string is more confusing... > > Hang on a minute, what does it even mean to decode a Unicode string? Let's not get philosophical ;) > That's where the problem is. Fortunately that's one that Py3 solved - > str simply doesn't have a decode() method. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten <__pete...@web.de> wrote: > I think that you may get a Unicode/Encode/Error when you try to /decode/ a > unicode string is more confusing... Hang on a minute, what does it even mean to decode a Unicode string? That's where the problem is. Fortunately that's one that Py3 solved - str simply doesn't have a decode() method. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
random...@fastmail.us wrote: > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote: >> >>> "%s nötig %s" % (u"üblich", u"ähnlich") >> Traceback (most recent call last): >> File "", line 1, in >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: >> ordinal not in range(128) > > This is surprising to me - why is it trying to decode the format string, > rather than encode the arguments? Probably to make it easier to mix byte and unicode strings. In hindsight it may not have been a good idea, but it had the potential to save some memory. I think that you may get a Unicode/Encode/Error when you try to /decode/ a unicode string is more confusing... -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau wrote: > Hi, > > Thanks for the "from __future__ import unicode_literals" trick, it makes > that switch much less intrusive. > > However it seems that I will suddenly be trapped by all modules which > are not prepared to handle unicode. For example: > > >>> from __future__ import unicode_literals > >>> import locale > >>> locale.setlocale(locale.LC_ALL, 'fr_FR') > Traceback (most recent call last): >File "", line 1, in >File "/usr/lib64/python2.7/locale.py", line 546, in setlocale > locale = normalize(_build_localename(locale)) >File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename > language, encoding = localetuple > ValueError: too many values to unpack > > Is the locale module an exception and in that case I'll fix it by doing: > > >>> locale.setlocale(locale.LC_ALL, b'fr_FR') > > or is a (big) part of the modules in python 2.7 still not ready for > unicode and in that case I have to decide which prefix (u or b) I should > manually add ? Sadly, there are quite a lot of parts of Python 2 that simply don't handle Unicode strings. But you can probably keep all of those down to just a handful of explicit b"whatever" strings; most places should accept unicode as well as str. What you're seeing here is a prime example of one of this author's points (caution, long post): http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/ """The lesson of Python 3 is: give programmers a Unicode string type, *make it the default*, and encoding issues will /mostly/ go away.""" There's a whole ecosystem to Python 2 - some in the standard library, heaps more in the rest of the world - and a lot of it was written on the assumption that a byte is a character is an octet. When you pass Unicode strings to functions written to expect byte strings, sometimes you win, and sometimes you lose... even with the standard library itself. But the Python 3 ecosystem has been written on the assumption that strings are Unicode. It's only a narrow set of programs ("boundary code", where you're moving text across networks and stuff like that) where the Python 2 model is easier to work with; and the recent Py3 releases have been progressively working to relieve that pain. The absolute worst case is a function which exists in Python 2 and 3, and requires a byte string in Py2 and a text string in Py3. Sadly, that may be exactly what locale.setlocale() is. For that, I would suggest explicitly passing stuff through str(): locale.setlocale(locale.LC_ALL, str('fr_FR')) In Python 3, 'fr_FR' is already a str, so passing it through str() will have no significant effect. (Though it would be worth commenting that, to make it clear to a subsequent reader that this is Py2 compat code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode, so passing it through str() will encode it to ASCII, producing a byte string that setlocale should be happy with. By the way, the reason for the strange error message is clearer in Python 3, which chains in another exception: >>> locale.setlocale(locale.LC_ALL, b'fr_FR') Traceback (most recent call last): File "/usr/local/lib/python3.5/locale.py", line 498, in _build_localename language, encoding = localetuple ValueError: too many values to unpack (expected 2) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.5/locale.py", line 594, in setlocale locale = normalize(_build_localename(locale)) File "/usr/local/lib/python3.5/locale.py", line 507, in _build_localename raise TypeError('Locale must be None, a string, or an iterable of two strings -- language code, encoding.') TypeError: Locale must be None, a string, or an iterable of two strings -- language code, encoding. So when it gets the wrong type of string, it attempts to unpack it as an iterable; it yields five values (the five bytes or characters, depending on which way it's the wrong type of string), but it's expecting two. Fortunately, str() will deal with this. But make sure you don't have the b prefix, or str() in Py3 will give you quite a different result! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Fri, Nov 21, 2014 at 12:59 AM, wrote: > On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote: >> >>> "%s nötig %s" % (u"üblich", u"ähnlich") >> Traceback (most recent call last): >> File "", line 1, in >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: >> ordinal not in range(128) > > This is surprising to me - why is it trying to decode the format string, > rather than encode the arguments? Why should it encode to bytes? Makes much better sense to work in Unicode. But mainly, it has to do one of them, and be predictable. If you add a float and an int, you have to predictably get back one of those two types, and since neither is a perfect superset of the other, one just has to be picked. (And that's float, because it's more likely to be the better option.) In this case, picking Unicode to meet on is easily the better option, because you'll often have pure-ASCII string literals as format strings, and Unicode data being interpolated into it. So using an ASCII codec is far more likely to succeed if you decode the format string than if you encode the data. Personally, I'd much rather be very clear about what's text and what's bytes, and not have any automatic encoding at all. That's why I use Python 3. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Hi, On 11/20/2014 11:47 AM, Chris Angelico wrote: > On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau > wrote: >> My question is: how should this be fixed properly ? >> >> A simple solution would be to force all strings passed to the >> logger to be unicode: >> >> log.debug(u"%s: %s" % ...) >> >> and more generally force all string in my code to be unicode by >> using the 'u' prefix. > > Yep. And then you may want to consider "from __future__ import > unicode_literals", which will make string literals represent Unicode > strings rather than byte strings. Basically the same as you're saying, > only without the explicit u prefixes. Thanks for the "from __future__ import unicode_literals" trick, it makes that switch much less intrusive. However it seems that I will suddenly be trapped by all modules which are not prepared to handle unicode. For example: >>> from __future__ import unicode_literals >>> import locale >>> locale.setlocale(locale.LC_ALL, 'fr_FR') Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/locale.py", line 546, in setlocale locale = normalize(_build_localename(locale)) File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename language, encoding = localetuple ValueError: too many values to unpack Is the locale module an exception and in that case I'll fix it by doing: >>> locale.setlocale(locale.LC_ALL, b'fr_FR') or is a (big) part of the modules in python 2.7 still not ready for unicode and in that case I have to decide which prefix (u or b) I should manually add ? Thanks. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote: > >>> "%s nötig %s" % (u"üblich", u"ähnlich") > Traceback (most recent call last): > File "", line 1, in > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: > ordinal not in range(128) This is surprising to me - why is it trying to decode the format string, rather than encode the arguments? -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014 at 11:35 PM, Peter Otten <__pete...@web.de> wrote: > You don't need to change an all-ascii bytestring to unicode. > Lo and behold: > "%s %s" % (u"üblich", u"ähnlich") > u'\xfcblich \xe4hnlich' u"%s %s" % (u"üblich", u"ähnlich") > u'\xfcblich \xe4hnlich' > > Only non-ascii bytestrings mean trouble, either noisy > It's better to not depend on that, though. Be clear and explicit about the difference between bytes and text, and don't try to pretend they're the same thing, even for ASCII. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
Francis Moreau wrote: > Hello, > > My application is using gettext module to do the translation > stuff. Translated messages are unicode on both python 2 and > 3 (with python2.7 I had to explicitely asked for unicode). > > A problem arises when formatting those messages before logging > them. For example: > > log.debug("%s: %s" % (header, _("will return an unicode string"))) This is only problematic if header is a non-ascii bytestring. > Indeed on python2.7, "%s: %s" is 'str' whereas _() returns > unicode. > > My question is: how should this be fixed properly ? > > A simple solution would be to force all strings passed to the > logger to be unicode: > > log.debug(u"%s: %s" % ...) > > and more generally force all string in my code to be unicode by > using the 'u' prefix. > > or is there a proper solution ? You don't need to change an all-ascii bytestring to unicode. Lo and behold: >>> "%s %s" % (u"üblich", u"ähnlich") u'\xfcblich \xe4hnlich' >>> u"%s %s" % (u"üblich", u"ähnlich") u'\xfcblich \xe4hnlich' Only non-ascii bytestrings mean trouble, either noisy >>> u"%s nötig %s" % (u"üblich", "ähnlich") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) >>> "%s nötig %s" % (u"üblich", u"ähnlich") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128) or silently until you have to decipher the logfile contents. It's best to stay away from them, and the from __future__ unicode_literals that Chris mentionend is a convenient way to achieve that. -- https://mail.python.org/mailman/listinfo/python-list
Re: python 2.7 and unicode (one more time)
On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau wrote: > My question is: how should this be fixed properly ? > > A simple solution would be to force all strings passed to the > logger to be unicode: > > log.debug(u"%s: %s" % ...) > > and more generally force all string in my code to be unicode by > using the 'u' prefix. Yep. And then you may want to consider "from __future__ import unicode_literals", which will make string literals represent Unicode strings rather than byte strings. Basically the same as you're saying, only without the explicit u prefixes. This will also make your Py2 code behave more like the way your Py3 code does (as bare string literals are always Unicode strings in Py3). ChrisA -- https://mail.python.org/mailman/listinfo/python-list
python 2.7 and unicode (one more time)
Hello, My application is using gettext module to do the translation stuff. Translated messages are unicode on both python 2 and 3 (with python2.7 I had to explicitely asked for unicode). A problem arises when formatting those messages before logging them. For example: log.debug("%s: %s" % (header, _("will return an unicode string"))) Indeed on python2.7, "%s: %s" is 'str' whereas _() returns unicode. My question is: how should this be fixed properly ? A simple solution would be to force all strings passed to the logger to be unicode: log.debug(u"%s: %s" % ...) and more generally force all string in my code to be unicode by using the 'u' prefix. or is there a proper solution ? Thanks. -- https://mail.python.org/mailman/listinfo/python-list