subject:"Unicode chr\(150\) en dash"

Re: Unicode chr(150) en dash

2008-04-18 Thread Martin v. Löwis

> 150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a
> superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT
> control codes \x80 to \x9F.

To disambiguate the two, when I want to refer to the one with the
control characters, I use the name "IANA ISO-8859-1" or "the IANA
version of Latin-1", or some such, to reflect the fact that it's
not the ISO standard, but the (unfortunately differing) IANA
registration thereof.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-18 Thread hdante

On Apr 18, 8:36 am, John Machin <[EMAIL PROTECTED]> wrote:
> hdante wrote:
>
> >  The character code in question (which is present in the page), 150,
> > doesn't exist in ISO-8859-1.
>
> Are you sure?  Consider (re-)reading all of the Wikipedia article.
>
> 150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a
> superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT
> control codes \x80 to \x9F.
>
> > See
>
> >  http://en.wikipedia.org/wiki/ISO/IEC_8859-1(the entry for 150 is
> > blank)
>
> You must have been looking at the table of the "lite" ISO 8859-1 (one
> hyphen). Reading further you will see \x96 described as SPA or "Start of
> Guarded Area". Then there is the ISO-8859-1 (two hyphens) table,
> including \x96.
>
> HTH,
> John

 Sorry, that's right, I should have been referring to the second
table.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-18 Thread John Machin

hdante wrote:

> 
>  The character code in question (which is present in the page), 150,
> doesn't exist in ISO-8859-1.

Are you sure?  Consider (re-)reading all of the Wikipedia article.

150 aka \x96 doesn't exist in ISO 8859-1. ISO-8859-1 (two hyphens) is a 
superset of ISO 8859-1 (one hyphen) and adds the not-very-useful-AFAICT 
control codes \x80 to \x9F.

> See
> 
>  http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is
> blank)

You must have been looking at the table of the "lite" ISO 8859-1 (one 
hyphen). Reading further you will see \x96 described as SPA or "Start of 
Guarded Area". Then there is the ISO-8859-1 (two hyphens) table, 
including \x96.

HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-18 Thread J. Clifford Dyer


On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <[EMAIL PROTECTED]> wrote:
> > 
> > >  Don't use old 8-bit encodings. Use UTF-8.
> > 
> > Yes, I'll try. But is a problem when I only want to read, not that I'm 
> > trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to 
> > standars if that doesn't make sense for their business.
> > 
> > I'll change the approach trying to filter the contents with htmllib and 
> > mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got 
> > things cleared up now.
> > 
> > Thanks to everyone for the great help.
> > 
> 
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.  
> 
> Something like:
> 
> for c in control_chars:
> if c in encoded_text:
>   unicode_text = encoded_text.decode('cp1252')
> break
> else:
> unicode_text = encoded_text.decode('latin-1')
> 
> Note that the else matches the for, not the if.
> 
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.

One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way.  Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break.  You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:

try:
unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
# do the stuff above

None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.

If in doubt, prompt the user for confirmation.

Maybe others can share better "best practices."

Cheers,
Cliff

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-18 Thread J. Clifford Dyer

On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> hdante <[EMAIL PROTECTED]> wrote:
> 
> >  Don't use old 8-bit encodings. Use UTF-8.
> 
> Yes, I'll try. But is a problem when I only want to read, not that I'm trying 
> to write or create the content.
> To blame I suppose is Microsoft's commercial success. They won't adhere to 
> standars if that doesn't make sense for their business.
> 
> I'll change the approach trying to filter the contents with htmllib and 
> mapping on my own those troubling characters.
> Anyway this has been a very instructive dive into unicode for me, I've got 
> things cleared up now.
> 
> Thanks to everyone for the great help.
> 

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.  

Something like:

for c in control_chars:
if c in encoded_text:
unicode_text = encoded_text.decode('cp1252')
break
else:
unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

Cheers,
Cliff

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-18 Thread marexposed

On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
hdante <[EMAIL PROTECTED]> wrote:

>  Don't use old 8-bit encodings. Use UTF-8.

Yes, I'll try. But is a problem when I only want to read, not that I'm trying 
to write or create the content.
To blame I suppose is Microsoft's commercial success. They won't adhere to 
standars if that doesn't make sense for their business.

I'll change the approach trying to filter the contents with htmllib and mapping 
on my own those troubling characters.
Anyway this has been a very instructive dive into unicode for me, I've got 
things cleared up now.

Thanks to everyone for the great help.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-17 Thread hdante

On Apr 17, 12:10 pm, [EMAIL PROTECTED] wrote:
> Thank you Martin and John, for you excellent explanations.
>
> I think I understand the unicode basic principles, what confuses me is the 
> usage different applications make out of it.
>
> For example, I got that EN DASH out of a web page which states  version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go 
> for that encoding. But if the browser can

 There's a trick here. Blame lax web standards and companies that
don't like standards.

 There's no EN DASH in ISO-8859-1. The first 256 characters in Unicode
are the same as ISO-8859-1, but EN DASH number is U+2013.

 The character code in question (which is present in the page), 150,
doesn't exist in ISO-8859-1. See

 http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (the entry for 150 is
blank)

 The character 150 exists in Windows-1252, however, which is a non-
standard clone of ISO-8859-1.

 http://en.wikipedia.org/wiki/Windows-1252

 Who is wrong ?
 - The guy who wrote the web site
 - The browser that does the trick.
 - Everybody for using a non-standard encoding
 - Everybody for using an outdated 8-bit encoding.

 Don't use old 8-bit encodings. Use UTF-8.

>
> I might need to go for python's htmllib to avoid this, not sure. But if I 
> don't, if I only want to just copy and paste some web pages text contents 
> into a tkinter Text widget, what should I do to succesfully make every single 
> character go all the way from the widget and out of tkinter into a python 
> string variable? How did my browser knew it should render an EN DASH instead 
> of a circumflexed lowercase u?
>
> This is the webpage in case you are interested, 4th line of first paragraph, 
> there is the EN 
> DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-...
>
> Thanks a lot.
>
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-17 Thread Martin v. Löwis

> For example, I got that EN DASH out of a web page which states  version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I
> did go for that encoding. But if the browser can properly decode that
> character using that encoding, how come other applications can't?

Please do trust us that ISO-8859-1 does *NOT* support EN DASH.

There are two possible explanations for the behavior you observed:
a) even though the file was declared ISO-8859-1, the data in it
   actually didn't use that encoding. The browser somehow found out,
   and chose a different encoding from the declared one.
b) the web page contained the character reference – (or –),
   or the entity reference –. XML allows to support arbitrary
   Unicode characters even in a file that is encoded with ASCII.

> I might need to go for python's htmllib to avoid this, not sure. But
> if I don't, if I only want to just copy and paste some web pages text
> contents into a tkinter Text widget, what should I do to succesfully
> make every single character go all the way from the widget and out of
> tkinter into a python string variable? How did my browser knew it
> should render an EN DASH instead of a circumflexed lowercase u?

Read the source of the web page to be certain.

> This is the webpage in case you are interested, 4th line of first
> paragraph, there is the EN DASH:
> http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Ok, this says – in several places, as well as “ and ”

HTH,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-17 Thread Richard Brodie


<[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]

> I think I understand the unicode basic principles, what confuses me is the 
> usage 
> different applications
> make out of it.
>
> For example, I got that EN DASH out of a web page which states
>  at the beggining. That's why I 
> did go for 
> that
> encoding. But if the browser can properly decode that character using  that 
> encoding, 
> how come
> other applications can't?

Browsers tend to guess what the author intended a lot.  In particular, they 
fudge the 
difference
between ISO8859-1 and Windows-1252. http://en.wikipedia.org/wiki/Windows-1252 


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-17 Thread s0suk3

On Apr 17, 10:10 am, [EMAIL PROTECTED] wrote:
> Thank you Martin and John, for you excellent explanations.
>
> I think I understand the unicode basic principles, what confuses me is the 
> usage different applications make out of it.
>
> For example, I got that EN DASH out of a web page which states  version="1.0" encoding="ISO-8859-1"?> at the beggining. That's why I did go 
> for that encoding. But if the browser can properly decode that character 
> using that encoding, how come other applications can't?
>
> I might need to go for python's htmllib to avoid this, not sure. But if I 
> don't, if I only want to just copy and paste some web pages text contents 
> into a tkinter Text widget, what should I do to succesfully make every single 
> character go all the way from the widget and out of tkinter into a python 
> string variable? How did my browser knew it should render an EN DASH instead 
> of a circumflexed lowercase u?
>
> This is the webpage in case you are interested, 4th line of first paragraph, 
> there is the EN 
> DASH:http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-...
>
> Thanks a lot.
>

Simplemente escribe en ingles. Like this, see? No encodings mess.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-17 Thread marexposed

Thank you Martin and John, for you excellent explanations.

I think I understand the unicode basic principles, what confuses me is the 
usage different applications make out of it.

For example, I got that EN DASH out of a web page which states  at the beggining. That's why I did go for 
that encoding. But if the browser can properly decode that character using that 
encoding, how come other applications can't?

I might need to go for python's htmllib to avoid this, not sure. But if I 
don't, if I only want to just copy and paste some web pages text contents into 
a tkinter Text widget, what should I do to succesfully make every single 
character go all the way from the widget and out of tkinter into a python 
string variable? How did my browser knew it should render an EN DASH instead of 
a circumflexed lowercase u?

This is the webpage in case you are interested, 4th line of first paragraph, 
there is the EN DASH: 
http://www.pagina12.com.ar/diario/elmundo/subnotas/102453-32303-2008-04-15.html

Thanks a lot.


On Wed, 16 Apr 2008 10:27:26 -0700
John Nagle <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > Hello guys & girls
> > 
> > I'm pasting an "en dash"
> > (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into
> > a tkinter widget, expecting it to be properly stored into a MySQL database.
> > 
> > I'm getting this error: 
> > *
> >  Exception in Tkinter callback Traceback (most recent call last): File
> > "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return
> > self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a) 
> > File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in 
> > execute
> >  query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't
> > encode character u'\u2013' in position 52: ordinal not in range(256) 
> > *
> 
>  Python and MySQL will do end to end Unicode quite well.  But that's
> not what you're doing.  How did "latin-1" get involved?
> 
>  If you want to use MySQL in Unicode, there are several things to be done.
> First, the connection has to be opened in Unicode:
> 
>   db = MySQLdb.connect(host="localhost",
>   use_unicode = True, charset = "utf8",
>   user=username, passwd=password, db=database)
> 
> Yes, you have to specify both "use_unicode=True", which tells the client
> to talk Unicode, and set "charset" to"utf8", which tells the server
> to talk Unicode encoded as UTF-8".
> 
> Then the tables need to be in Unicode.  In SQL,
> 
>  ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;
> 
> before creating the tables.  You can also change the types of
> existing tables and even individual fields to utf8, if necessary.
> (This takes time for big tables; the table is copied.  But it works.)
> 
>  It's possible to get MySQL to store character sets other than
> ASCII or Unicode; you can store data in "latin1" if you want. This
> might make sense if, for example, all your data is in French or German,
> which maps well to "latin1".  Unless that's your situation, go with
> either all-ASCII or all-Unicode.  It's less confusing.
> 
>   John Nagle
> -- 
> http://mail.python.org/mailman/listinfo/python-list
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-16 Thread John Nagle

[EMAIL PROTECTED] wrote:
> Hello guys & girls
> 
> I'm pasting an "en dash"
> (http://www.fileformat.info/info/unicode/char/2013/index.htm) character into
> a tkinter widget, expecting it to be properly stored into a MySQL database.
> 
> I'm getting this error: 
> *
>  Exception in Tkinter callback Traceback (most recent call last): File
> "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__ return
> self.func(*args) File "chupadato.py", line 25, in guardar cursor.execute(a) 
> File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute
>  query = query.encode(charset) UnicodeEncodeError: 'latin-1' codec can't
> encode character u'\u2013' in position 52: ordinal not in range(256) 
> *

 Python and MySQL will do end to end Unicode quite well.  But that's
not what you're doing.  How did "latin-1" get involved?

 If you want to use MySQL in Unicode, there are several things to be done.
First, the connection has to be opened in Unicode:

db = MySQLdb.connect(host="localhost",
use_unicode = True, charset = "utf8",
user=username, passwd=password, db=database)

Yes, you have to specify both "use_unicode=True", which tells the client
to talk Unicode, and set "charset" to"utf8", which tells the server
to talk Unicode encoded as UTF-8".

Then the tables need to be in Unicode.  In SQL,

 ALTER DATABASE dbname DEFAULT CHARACTER SET utf8;

before creating the tables.  You can also change the types of
existing tables and even individual fields to utf8, if necessary.
(This takes time for big tables; the table is copied.  But it works.)

 It's possible to get MySQL to store character sets other than
ASCII or Unicode; you can store data in "latin1" if you want. This
might make sense if, for example, all your data is in French or German,
which maps well to "latin1".  Unless that's your situation, go with
either all-ASCII or all-Unicode.  It's less confusing.

John Nagle
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

2008-04-15 Thread Martin v. Löwis

> "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in
> execute query = query.encode(charset) UnicodeEncodeError: 'latin-1'
> codec can't encode character u'\u2013' in position 52: ordinal not in
> range(256) 

Here it complains that it deals with the character U+2013, which
is "EN DASH"; it complains that the encoding called "latin-1" does
not support that character.

That is a fact - Latin-1 does not support EN DASH.

> When I type 'print chr(150)' into a python command line window I get
> a LATIN SMALL LETTER U WITH CIRCUMFLEX
> (http://www.fileformat.info/info/unicode/char/00fb/index.htm),

That's because your console uses the code page 437:

py> chr(150).decode("cp437")
u'\xfb'
py> unicodedata.name(_)
'LATIN SMALL LETTER U WITH CIRCUMFLEX'

Code page 437, on your system, is the "OEM code page".

> but when I do so into a IDLE window I get a hypen (chr(45).

That's because IDLE uses the "ANSI code page" of your system,
which is windows code page 1252.

py> chr(150).decode("windows-1252")
u'\u2013'
py> unicodedata.name(_)
'EN DASH'

You actually *don't* get the character U+002D, HYPHEN-MINUS,
displayed - just a character that has, in your font, a glyph
which looks similar to the glyph for HYPHEN-MINUS.
However, HYPHEN-MINUS and EN DASH are different characters, and
IDLE displays the latter, not the former.

> I tried searching "en dash" or even "dash" into the encodings folder
> of python Lib, but I couldn't find anything.

You didn't ask a specific question, so I assume you are primarily
after an explanation.

HTH,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode chr(150) en dash

2008-04-15 Thread marexposed

Hello guys & girls

I'm pasting an "en dash" 
(http://www.fileformat.info/info/unicode/char/2013/index.htm) character into a 
tkinter widget, expecting it to be properly stored into a MySQL database.

I'm getting this error:
*
Exception in Tkinter callback
Traceback (most recent call last):
  File "C:\Python24\lib\lib-tk\Tkinter.py", line 1345, in __call__
return self.func(*args)
  File "chupadato.py", line 25, in guardar
cursor.execute(a)
  File "C:\Python24\Lib\site-packages\MySQLdb\cursors.py", line 149, in execute
query = query.encode(charset)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position
 52: ordinal not in range(256)
*

Variable 'a' in 'cursor.execute(a)' contains a proper SQL statement, which 
includes the 'en dash' character just pasted into the Text widget.

When I type 'print chr(150)' into a python command line window I get a LATIN 
SMALL LETTER U WITH CIRCUMFLEX 
(http://www.fileformat.info/info/unicode/char/00fb/index.htm), but when I do so 
into a IDLE window I get a hypen (chr(45).

Funny thing I quite don't understand is, when I do paste that 'en dash' 
character into a python command window (I'm using MSWindows), the character is 
conveniently converted to chr(45) which is a hyphen (I wouldn't mind if I could 
do that by coding, I mean 'adapting' by visual similarity).

I tried searching "en dash" or even "dash" into the encodings folder of python 
Lib, but I couldn't find anything.

I'm using Windows Vista english, Python 2.4, latest MySQLdb. Default encoding 
changed (while testing) into "iso-8859-1".

Thanks for any help.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Re: Unicode chr(150) en dash

Unicode chr(150) en dash

14 matches

Site Navigation

Mail list logo

Footer information