Re: Unicode chr(150) en dash
> 150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a > superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT > control codes \x80 to \x9F. To disambiguate the two, when I want to refer to the one with the control characters, I use the name "IANA ISO-8859-1" or "the IANA version of Latin-1", or some such, to reflect the fact that it's not the ISO standard, but the (unfortunately differing) IANA registration thereof. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Apr 18, 8:36 am, John Machin <[EMAIL PROTECTED]> wrote: > hdante wrote: > > > The character code in question (which is present in the page), 150, > > doesn't exist in ISO-8859-1. > > Are you sure? Consider (re-)reading all of the Wikipedia article. > > 150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a > superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT > control codes \x80 to \x9F. > > > See > > > http://en.wikipedia.org/wiki/ISO/IEC_8859-1(the entry for 150 is > > blank) > > You must have been looking at the table of the "lite" ISO 8859-1 (one > hyphen). Reading further you will see \x96 described as SPA or "Start of > Guarded Area". Then there is the ISO-8859-1 (two hyphens) table, > including \x96. > > HTH, > John Sorry, that's right, I should have been referring to the second table. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
hdante wrote: > > The character code in question (which is present in the page), 150, > doesn't exist in ISO-8859-1. Are you sure? Consider (re-)reading all of the Wikipedia article. 150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT control codes \x80 to \x9F. > See > > http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is > blank) You must have been looking at the table of the "lite" ISO 8859-1 (one hyphen). Reading further you will see \x96 described as SPA or "Start of Guarded Area". Then there is the ISO-8859-1 (two hyphens) table, including \x96. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote: > On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote: > > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT) > > hdante <[EMAIL PROTECTED]> wrote: > > > > > Don't use old 8-bit encodings. Use UTF-8. > > > > Yes, I'll try. But is a problem when I only want to read, not that I'm > > trying to write or create the content. > > To blame I suppose is Microsoft's commercial success. They won't adhere to > > standars if that doesn't make sense for their business. > > > > I'll change the approach trying to filter the contents with htmllib and > > mapping on my own those troubling characters. > > Anyway this has been a very instructive dive into unicode for me, I've got > > things cleared up now. > > > > Thanks to everyone for the great help. > > > > There are a number of code points (150 being one of them) that are used > in cp1252, which are reserved for control characters in ISO-8859-1. > Those characters will pretty much never be used in ISO-8859-1 documents. > If you're expecting documents of both types coming in, test for the > presence of those characters, and assume cp1252 for those documents. > > Something like: > > for c in control_chars: > if c in encoded_text: > unicode_text = encoded_text.decode('cp1252') > break > else: > unicode_text = encoded_text.decode('latin-1') > > Note that the else matches the for, not the if. > > You can figure out the characters to match on by looking at the > wikipedia pages for the encodings. One warning: This works if you know all your documents are in one of those two encodings, but you could break other encodings, like UTF-8 this way. Fortunately UTF-8 is a pretty fragile encoding, so it's easy to break. You can usually test if a document is decent UTF-8 just by wrapping it in a try except block: try: unicode_text = encoded.text.decode('utf-8') except UnicodeEncodeError: # I think that's the proper exception # do the stuff above None of these are perfect methods, but then again, if text encoding detection were a perfect science, python could just handle it on its own. If in doubt, prompt the user for confirmation. Maybe others can share better "best practices." Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote: > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT) > hdante <[EMAIL PROTECTED]> wrote: > > > Don't use old 8-bit encodings. Use UTF-8. > > Yes, I'll try. But is a problem when I only want to read, not that I'm trying > to write or create the content. > To blame I suppose is Microsoft's commercial success. They won't adhere to > standars if that doesn't make sense for their business. > > I'll change the approach trying to filter the contents with htmllib and > mapping on my own those troubling characters. > Anyway this has been a very instructive dive into unicode for me, I've got > things cleared up now. > > Thanks to everyone for the great help. > There are a number of code points (150 being one of them) that are used in cp1252, which are reserved for control characters in ISO-8859-1. Those characters will pretty much never be used in ISO-8859-1 documents. If you're expecting documents of both types coming in, test for the presence of those characters, and assume cp1252 for those documents. Something like: for c in control_chars: if c in encoded_text: unicode_text = encoded_text.decode('cp1252') break else: unicode_text = encoded_text.decode('latin-1') Note that the else matches the for, not the if. You can figure out the characters to match on by looking at the wikipedia pages for the encodings. Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Thu, 17 Apr 2008 20:57:21 -0700 (PDT) hdante <[EMAIL PROTECTED]> wrote: > Don't use old 8-bit encodings. Use UTF-8. Yes, I'll try. But is a problem when I only want to read, not that I'm trying to write or create the content. To blame I suppose is Microsoft's commercial success. They won't adhere to standars if that doesn't make sense for their business. I'll change the approach trying to filter the contents with htmllib and mapping on my own those troubling characters. Anyway this has been a very instructive dive into unicode for me, I've got things cleared up now. Thanks to everyone for the great help. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Apr 17, 12:10 pm, [EMAIL PROTECTED] wrote: > Thank you Martin and John, for you excellent explanations. > > I think I understand the unicode basic principles, what confuses me is the > usage different applications make out of it. > > For example, I got that EN DASH out of a web page which states version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go > for that encoding. But if the browser can There's a trick here. Blame lax web standards and companies that don't like standards. There's no EN DASH in ISO-8859-1. The first 256 characters in Unicode are the same as ISO-8859-1, but EN DASH number is U+2013. The character code in question (which is present in the page), 150, doesn't exist in ISO-8859-1. See http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is blank) The character 150 exists in Windows-1252, however, which is a non- standard clone of ISO-8859-1. http://en.wikipedia.org/wiki/Windows-1252 Who is wrong ? - The guy who wrote the web site - The browser that does the trick. - Everybody for using a non-standard encoding - Everybody for using an outdated 8-bit encoding. Don't use old 8-bit encodings. Use UTF-8. > > I might need to go for python's htmllib to avoid this, not sure. But if I > don't, if I only want to just copy and paste some web pages text contents > into a tkinter Text widget, what should I do to succesfully make every single > character go all the way from the widget and out of tkinter into a python > string variable? How did my browser knew it should render an EN DASH instead > of a circumflexed lowercase u? > > This is the webpage in case you are interested, 4th line of first paragraph, > there is the EN > DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-... > > Thanks a lot. > -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
> For example, I got that EN DASH out of a web page which states version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I > did go for that encoding. But if the browser can properly decode that > character using that encoding, how come other applications can't? Please do trust us that ISO-8859-1 does *NOT* support EN DASH. There are two possible explanations for the behavior you observed: a) even though the file was declared ISO-8859-1, the data in it actually didn't use that encoding. The browser somehow found out, and chose a different encoding from the declared one. b) the web page contained the character reference – (or –), or the entity reference –. XML allows to support arbitrary Unicode characters even in a file that is encoded with ASCII. > I might need to go for python's htmllib to avoid this, not sure. But > if I don't, if I only want to just copy and paste some web pages text > contents into a tkinter Text widget, what should I do to succesfully > make every single character go all the way from the widget and out of > tkinter into a python string variable? How did my browser knew it > should render an EN DASH instead of a circumflexed lowercase u? Read the source of the web page to be certain. > This is the webpage in case you are interested, 4th line of first > paragraph, there is the EN DASH: > http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html Ok, this says – in several places, as well as “ and ” HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > I think I understand the unicode basic principles, what confuses me is the > usage > different applications > make out of it. > > For example, I got that EN DASH out of a web page which states > at the beggining. That's why I > did go for > that > encoding. But if the browser can properly decode that character using that > encoding, > how come > other applications can't? Browsers tend to guess what the author intended a lot. In particular, they fudge the difference between ISO8859-1 and Windows-1252. http://en.wikipedia.org/wiki/Windows-1252 -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
On Apr 17, 10:10 am, [EMAIL PROTECTED] wrote: > Thank you Martin and John, for you excellent explanations. > > I think I understand the unicode basic principles, what confuses me is the > usage different applications make out of it. > > For example, I got that EN DASH out of a web page which states version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go > for that encoding. But if the browser can properly decode that character > using that encoding, how come other applications can't? > > I might need to go for python's htmllib to avoid this, not sure. But if I > don't, if I only want to just copy and paste some web pages text contents > into a tkinter Text widget, what should I do to succesfully make every single > character go all the way from the widget and out of tkinter into a python > string variable? How did my browser knew it should render an EN DASH instead > of a circumflexed lowercase u? > > This is the webpage in case you are interested, 4th line of first paragraph, > there is the EN > DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-... > > Thanks a lot. > Simplemente escribe en ingles. Like this, see? No encodings mess. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
Thank you Martin and John, for you excellent explanations. I think I understand the unicode basic principles, what confuses me is the usage different applications make out of it. For example, I got that EN DASH out of a web page which states at the beggining. That's why I did go for that encoding. But if the browser can properly decode that character using that encoding, how come other applications can't? I might need to go for python's htmllib to avoid this, not sure. But if I don't, if I only want to just copy and paste some web pages text contents into a tkinter Text widget, what should I do to succesfully make every single character go all the way from the widget and out of tkinter into a python string variable? How did my browser knew it should render an EN DASH instead of a circumflexed lowercase u? This is the webpage in case you are interested, 4th line of first paragraph, there is the EN DASH: http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html Thanks a lot. On Wed, 16 Apr 2008 10:27:26 -0700 John Nagle <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > Hello guys & girls > > > > I'm pasting an "en dash" > > (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into > > a tkinter widget, expecting it to be properly stored into a MySQL database. > > > > I'm getting this error: > > * > > Exception in Tkinter callback Traceback (most recent call last): File > > "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return > > self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a) > > File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in > > execute > > query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't > > encode character u'\u2013' in position 52: ordinal not in range(256) > > * > > Python and MySQL will do end to end Unicode quite well. But that's > not what you're doing. How did "latin-1" get involved? > > If you want to use MySQL in Unicode, there are several things to be done. > First, the connection has to be opened in Unicode: > > db = MySQLdb.connect(host="localhost", > use_unicode = True, charset = "utf8", > user=username, passwd=password, db=database) > > Yes, you have to specify both "use_unicode=True", which tells the client > to talk Unicode, and set "charset" to"utf8", which tells the server > to talk Unicode encoded as UTF-8". > > Then the tables need to be in Unicode. In SQL, > > ALTER DATABASE dbname DEFAULT CHARACTER SET utf8; > > before creating the tables. You can also change the types of > existing tables and even individual fields to utf8, if necessary. > (This takes time for big tables; the table is copied. But it works.) > > It's possible to get MySQL to store character sets other than > ASCII or Unicode; you can store data in "latin1" if you want. This > might make sense if, for example, all your data is in French or German, > which maps well to "latin1". Unless that's your situation, go with > either all-ASCII or all-Unicode. It's less confusing. > > John Nagle > -- > http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
[EMAIL PROTECTED] wrote: > Hello guys & girls > > I'm pasting an "en dash" > (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into > a tkinter widget, expecting it to be properly stored into a MySQL database. > > I'm getting this error: > * > Exception in Tkinter callback Traceback (most recent call last): File > "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return > self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a) > File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute > query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't > encode character u'\u2013' in position 52: ordinal not in range(256) > * Python and MySQL will do end to end Unicode quite well. But that's not what you're doing. How did "latin-1" get involved? If you want to use MySQL in Unicode, there are several things to be done. First, the connection has to be opened in Unicode: db = MySQLdb.connect(host="localhost", use_unicode = True, charset = "utf8", user=username, passwd=password, db=database) Yes, you have to specify both "use_unicode=True", which tells the client to talk Unicode, and set "charset" to"utf8", which tells the server to talk Unicode encoded as UTF-8". Then the tables need to be in Unicode. In SQL, ALTER DATABASE dbname DEFAULT CHARACTER SET utf8; before creating the tables. You can also change the types of existing tables and even individual fields to utf8, if necessary. (This takes time for big tables; the table is copied. But it works.) It's possible to get MySQL to store character sets other than ASCII or Unicode; you can store data in "latin1" if you want. This might make sense if, for example, all your data is in French or German, which maps well to "latin1". Unless that's your situation, go with either all-ASCII or all-Unicode. It's less confusing. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode chr(150) en dash
> "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in > execute query = query.encode(charset) UnicodeEncodeError: 'latin-1' > codec can't encode character u'\u2013' in position 52: ordinal not in > range(256) Here it complains that it deals with the character U+2013, which is "EN DASH"; it complains that the encoding called "latin-1" does not support that character. That is a fact - Latin-1 does not support EN DASH. > When I type 'print chr(150)' into a python command line window I get > a LATIN SMALL LETTER U WITH CIRCUMFLEX > (http://www.fileformat.info/info/unicode/char/00fb/index.htm), That's because your console uses the code page 437: py> chr(150).decode("cp437") u'\xfb' py> unicodedata.name(_) 'LATIN SMALL LETTER U WITH CIRCUMFLEX' Code page 437, on your system, is the "OEM code page". > but when I do so into a IDLE window I get a hypen (chr(45). That's because IDLE uses the "ANSI code page" of your system, which is windows code page 1252. py> chr(150).decode("windows-1252") u'\u2013' py> unicodedata.name(_) 'EN DASH' You actually *don't* get the character U+002D, HYPHEN-MINUS, displayed - just a character that has, in your font, a glyph which looks similar to the glyph for HYPHEN-MINUS. However, HYPHEN-MINUS and EN DASH are different characters, and IDLE displays the latter, not the former. > I tried searching "en dash" or even "dash" into the encodings folder > of python Lib, but I couldn't find anything. You didn't ask a specific question, so I assume you are primarily after an explanation. HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list
Unicode chr(150) en dash
Hello guys & girls I'm pasting an "en dash" (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into a tkinter widget, expecting it to be properly stored into a MySQL database. I'm getting this error: * Exception in Tkinter callback Traceback (most recent call last): File "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a) File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 52: ordinal not in range(256) * Variable 'a' in 'cursor.execute(a)' contains a proper SQL statement, which includes the 'en dash' character just pasted into the Text widget. When I type 'print chr(150)' into a python command line window I get a LATIN SMALL LETTER U WITH CIRCUMFLEX (http://www.fileformat.info/info/unicode/char/00fb/index.htm), but when I do so into a IDLE window I get a hypen (chr(45). Funny thing I quite don't understand is, when I do paste that 'en dash' character into a python command window (I'm using MSWindows), the character is conveniently converted to chr(45) which is a hyphen (I wouldn't mind if I could do that by coding, I mean 'adapting' by visual similarity). I tried searching "en dash" or even "dash" into the encodings folder of python Lib, but I couldn't find anything. I'm using Windows Vista english, Python 2.4, latest MySQLdb. Default encoding changed (while testing) into "iso-8859-1". Thanks for any help. -- http://mail.python.org/mailman/listinfo/python-list