Re: [Tutor] encoding question

2014-01-05 Thread spir

On 01/05/2014 12:52 AM, Steven D'Aprano wrote:

If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never use
a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.


An exception, or any other kind of anomaly detected by a func one calls, is in 
most cases a *symptom* of an error, somewhere else in one's code (possibly far 
in source, possibly long earlier, possibly apparently unrelated). Catching an 
exception (except in rare cases), is just suppressing a _signal_ about a 
probable error. Catching an exception does not make the code correct, it just 
pretends to (except in rare cases). It's like hiding the dirt under a carpet, or 
beating up the poor guy that ran for 3 kilometers to tell you a fire in 
threatening your home.


Again: the anomaly (eg wrong input) detected by a func is not the error; it is a 
consequence of the true original error, what one should aim at correcting. (But 
our culture apparently loves repressing symptoms rather than curing actual 
problems: we programmers just often thoughtlessly apply the scheme ;-)


We should instead gratefully thank func authors for having correctly done their 
jobs of controlling input. They offer us the information needed to find bugs 
which otherwise may happily go on their lives undetected; and thus the 
opportunity to write more correct software. (This is why func authors should 
control input, refuse any anomalous or dubious values, and never ever try to 
guess what the app expects in such cases; instead just say cannot do my job 
safely, or at all.)


If one is passing an empty set to an 'average' func, don't blame the func or 
shut up the signal/exception, instead be grateful to the func's author, and find 
why and how it happens the set is empty. If one is is trying to write into a 
file, don't blame the file for not existing, the user for being stupid, or shut 
up the signal/exception, instead be grateful to the func's author, and find why 
and how it happens the file does not exist, now (about the user: is your doc 
clear enough?).


The sub-category of cases where exception handling makes sense at all is the 
following:
* a called function may fail (eg average, find a given item in a list, write 
into a file)

* and, the failure case makes sense for the app, it _does_ belong to the app 
logic
* and, the case should nevertheless be handled like others up to this point in 
code (meaning, there should not be a separate branch for it, we should really 
land there in code even for this failure case)
* and, one cannot know whether it is a failure case without trying, or it would 
be as costly as just trying (wrong for average, right for 2 other examples)
* and, one can repair the failure right here, in any case, and go on correctly 
according to the app logic (depends on apps) (there is also the category of 
alternate running modes)


In such a situation, the right thing to do is to catch the exception signal (or 
use whatever error management exists, eg a check for a None return value) and 
proceed correctly (and think at testing this case ;-).


But this is not that common. In particular, if the failure case does not belong 
to the app logic (the item should be there, the file should exist) then do *not* 
catch a potential signal: if it happens, it would tell you about a bug 
*elsewhere* in code; and _this_ is what is to correct.


There a mythology in programming, that software should not crash; wrongly 
understood (or rightly, authors of such texts usually are pretty unclear and 
ambiguous), this leads to catching exceptions that are just signal of symptoms 
of errors... Instead, software should crash whenever it is incorrect; often 
(when the error does not cause obvious misbehaviour) it is the only way for the 
programmer to know about errors. Crashes are the programmer's best friend (I 
mean, those programmers which aim is to write quality software).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread spir

On 01/04/2014 08:26 PM, Alex Kleider wrote:

Any suggestions as to a better way to handle the problem of encoding in the
following context would be appreciated.  The problem arose because 'Bogota' is
spelt with an acute accent on the 'a'.

$ cat IP_info.py3
#!/usr/bin/env python3
# -*- coding : utf -8 -*-
# file: 'IP_info.py3'  a module.

import urllib.request

url_format_str = \
 'http://api.hostip.info/get_html.php?ip=%sposition=true'

def ip_info(ip_address):
 
Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!

 response =  urllib.request.urlopen(url_format_str %\
(ip_address, )).read()
 sp = response.splitlines()
 country = city = lat = lon = ip = ''
 for item in sp:
 if item.startswith(bCountry:):
 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)
 country = item[9:]
 elif item.startswith(bCity:):
 try:
 city = item[6:].decode('utf-8')
 except:
 print(Exception raised.)
 city = item[6:]
 elif item.startswith(bLatitude:):
 try:
 lat = item[10:].decode('utf-8')
 except:
 print(Exception raised.)
 lat = item[10]
 elif item.startswith(bLongitude:):
 try:
 lon = item[11:].decode('utf-8')
 except:
 print(Exception raised.)
 lon = item[11]
 elif item.startswith(bIP:):
 try:
 ip = item[4:].decode('utf-8')
 except:
 print(Exception raised.)
 ip = item[4:]
 return {Country : country,
 City : city,
 Lat : lat,
 Long : lon,
 IP : ip}

if __name__ == __main__:
 addr =  201.234.178.62
 print (IP address is %(IP)s:
 Country: %(Country)s;  City: %(City)s.
 Lat/Long: %(Lat)s/%(Long)s % ip_info(addr))


The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.

Thanks,
alex


'á' does not encode to 0xe1 in utf8 encoding; what you read is probably (legacy) 
files in probably latin-1 (or another latin-* encoding).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread spir

On 01/05/2014 03:31 AM, Alex Kleider wrote:

I've been maintaining both a Python3 and a Python2.7 version.  The latter has
actually opened my eyes to more complexities. Specifically the need to use
unicode strings rather than Python2.7's default ascii.


So-called Unicode strings are not the solution to all problems. Example with 
your 'á', which can be represented by either 1 precomposed code (unicode code 
point) 0xe1, or ibasically by 2 ucodes (one for the base 'a', one for the 
combining '´'). Imagine you search for Bogotá: how do you know which is 
reprsentation is used in the text you search? How do you know at all there are 
multiple representations, and what they are? The routine wil work iff, by 
chance, your *programming editor* (!) used the same representation as the 
software used to create the searched test...


Usually it the case, because most text-creation software use precomposed codes, 
when they exist, for composite characters. (But this fact just makes the issue 
more rare, hard to be aware of, and thus difficult to cope with correctly in 
code. As far as I know nearly no software does it.)


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread spir

On 01/05/2014 08:57 AM, Alex Kleider wrote:

On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.


Well, I've tried the xml approach which seems promising but still I get an
encoding related error.

.org/mailman/listinfo/tutor

Note that the (computing) data description format (JSON, XML...) and the textual 
format, or encoding (Unicode utf8/16/32, legacy iso-8859-* also called 
latin-*, ...) are more or less unrelated and independant. Changing the data 
description format cannot solve a text encoding issue (but it may hide it, if by 
chance the new data description format happened to use the text encoding you 
happen to use when reading, implicitely or explicitely).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread Mark Lawrence

On 05/01/2014 02:31, Alex Kleider wrote:


I've been maintaining both a Python3 and a Python2.7 version.  The
latter has actually opened my eyes to more complexities. Specifically
the need to use unicode strings rather than Python2.7's default ascii.



This might help http://python-future.org/

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread Steven D'Aprano
On Sat, Jan 04, 2014 at 11:57:20PM -0800, Alex Kleider wrote:

 Well, I've tried the xml approach which seems promising but still I get 
 an encoding related error.
 Is there a bug in the xml.etree module (not very likely, me thinks) or 
 am I doing something wrong?

I'm no expert on XML, but it looks to me like it is a bug in 
ElementTree. It doesn't appear to handle unicode strings correctly 
(although perhaps it doesn't promise to).

A simple demonstration using Python 2.7:

py import xml.etree.ElementTree as ET
py ET.fromstring(u'xmla/xml')
Element 'xml' at 0xb7ca982c

But:

py ET.fromstring(u'xmlá/xml')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1282, in XML
parser.feed(text)
  File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1622, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in 
position 5: ordinal not in range(128)

An easy work-around:

py ET.fromstring(u'xmlá/xml'.encode('utf-8'))
Element 'xml' at 0xb7ca9a8c

although, as I said, I'm no expert on XML and this may lead to errors 
later on.


 There's no denying that the whole encoding issue is still not completely 
 clear to me in spite of having devoted a lot of time to trying to grasp 
 all that's involved.

Have you read Joel On Software's explanation?

http://www.joelonsoftware.com/articles/Unicode.html

It's well worth reading. Start with that, and then ask if you have any 
further questions.


 Here's what I've got:
 
 alex@x301:~/Python/Parse$ cat ip_xml.py
 #!/usr/bin/env python
 # -*- coding : utf -8 -*-
 # file: 'ip_xml.py'
[...]
 tree = ET.fromstring(xml)
 root = tree.getroot()   # Here's where it blows up!!!

I reckon that what you need is to change the first line to:

tree = ET.fromstring(xml.encode('latin-1'))

or whatever the encoding is meant to be.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread eryksun
On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net wrote:
 def ip_info(ip_address):

 response =  urllib2.urlopen(url_format_str %\
(ip_address, ))
 encoding = response.headers.getparam('charset')
 print 'encoding' is '%s'. % (encoding, )
 info = unicode(response.read().decode(encoding))

decode() returns a unicode object.

 n = info.find('\n')
 print location of first newline is %s. % (n, )
 xml = info[n+1:]
 print 'xml' is '%s'. % (xml, )

 tree = ET.fromstring(xml)
 root = tree.getroot()   # Here's where it blows up!!!
 print 'root' is '%s', with the following children: % (root, )
 for child in root:
 print child.tag, child.attrib
 print END of CHILDREN
 return info

Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?

Leave it to ElementTree. Here's something to get you started:

import urllib2
import xml.etree.ElementTree as ET
import collections

url_format_str = 'http://api.hostip.info/?ip=%sposition=true'
GML = 'http://www.opengis.net/gml'
IPInfo = collections.namedtuple('IPInfo', '''
ip
city
country
latitude
longitude
''')

def ip_info(ip_address):
response = urllib2.urlopen(url_format_str %
   ip_address)
tree = ET.fromstring(response.read())
hostip = tree.find('{%s}featureMember/Hostip' % GML)
ip = hostip.find('ip').text
city = hostip.find('{%s}name' % GML).text
country = hostip.find('countryName').text
coord = hostip.find('.//{%s}coordinates' % GML).text
lon, lat = coord.split(',')
return IPInfo(ip, city, country, lat, lon)


 info = ip_info('201.234.178.62')
 info.ip
'201.234.178.62'
 info.city, info.country
(u'Bogot\xe1', 'COLOMBIA')
 info.latitude, info.longitude
('10.4', '-75.2833')

This assumes everything works perfect. You have to decide how to fail
gracefully for the service being unavailable or malformed XML
(incomplete or corrupted response, etc).
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread Alex Kleider

On 2014-01-05 08:02, eryksun wrote:
On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net 
wrote:

def ip_info(ip_address):

response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
print 'encoding' is '%s'. % (encoding, )
info = unicode(response.read().decode(encoding))


decode() returns a unicode object.


n = info.find('\n')
print location of first newline is %s. % (n, )
xml = info[n+1:]
print 'xml' is '%s'. % (xml, )

tree = ET.fromstring(xml)
root = tree.getroot()   # Here's where it blows up!!!
print 'root' is '%s', with the following children: % (root, )
for child in root:
print child.tag, child.attrib
print END of CHILDREN
return info


Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?

Leave it to ElementTree. Here's something to get you started:

import urllib2
import xml.etree.ElementTree as ET
import collections

url_format_str = 'http://api.hostip.info/?ip=%sposition=true'
GML = 'http://www.opengis.net/gml'
IPInfo = collections.namedtuple('IPInfo', '''
ip
city
country
latitude
longitude
''')

def ip_info(ip_address):
response = urllib2.urlopen(url_format_str %
   ip_address)
tree = ET.fromstring(response.read())
hostip = tree.find('{%s}featureMember/Hostip' % GML)
ip = hostip.find('ip').text
city = hostip.find('{%s}name' % GML).text
country = hostip.find('countryName').text
coord = hostip.find('.//{%s}coordinates' % GML).text
lon, lat = coord.split(',')
return IPInfo(ip, city, country, lat, lon)


 info = ip_info('201.234.178.62')
 info.ip
'201.234.178.62'
 info.city, info.country
(u'Bogot\xe1', 'COLOMBIA')
 info.latitude, info.longitude
('10.4', '-75.2833')

This assumes everything works perfect. You have to decide how to fail
gracefully for the service being unavailable or malformed XML
(incomplete or corrupted response, etc).


Thanks again for the input.
You're using some ET syntax there that would probably make my code much 
more readable but will require a bit more study on my part.


I was up all night trying to get this sorted out and was finally 
successful.

(Re-) Reading 'joelonsoftware' and some of the Python docs helped.
Here's what I came up with (still needs modification to return a 
dictionary, but that'll be trivial.)


alex@x301:~/Python/Parse$ cat ip_xml.py
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# -*- coding : utf-8 -*-
# file: 'ip_xml.py'

import urllib2
import xml.etree.ElementTree as ET


url_format_str = \
u'http://api.hostip.info/?ip=%sposition=true'

def ip_info(ip_address):
response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# info comes in as type 'unicode'.
n = info.find('\n')
xml = info[n+1:]  # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode(utf-8))
# This is the part I still don't fully understand but would
# probably have to look at the library source to do so.
info = []
for i in range(4):
info.append(root[3][0][i].text)
info.append(root[3][0][4][0][0][0].text)

return info

if __name__ == __main__:
info = ip_info(201.234.178.62)
print info
print info[1]

alex@x301:~/Python/Parse$ ./ip_xml.py
['201.234.178.62', u'Bogot\xe1', 'COLOMBIA', 'CO', '-75.2833,10.4']
Bogotá

Thanks to all who helped.
ak
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread Steven D'Aprano
On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:

 Danny walked you through the XML. Note that he didn't decode the
 response. It includes an encoding on the first line:
 
 ?xml version=1.0 encoding=ISO-8859-1 ?

That surprises me. I thought XML was only valid in UTF-8? Or maybe that 
was wishful thinking.

 tree = ET.fromstring(response.read())

In other words, leave it to ElementTree to manage the decoding and 
encoding itself. Nice -- I like that solution.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread Alex Kleider

On 2014-01-05 14:26, Steven D'Aprano wrote:

On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:


Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?


That surprises me. I thought XML was only valid in UTF-8? Or maybe that
was wishful thinking.


tree = ET.fromstring(response.read())


I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having 
been advertised as having been encoded in ISO-8859-1 (which I believe is 
synonymous with Latin-1), my script (specifically Python's xml parser: 
xml.etree.ElementTree) didn't work until the xml was decoded from 
Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet 
with some comments mentioning the painful lessons learned:


response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# info comes in as type 'unicode'.
n = info.find('\n')
xml = info[n+1:]  # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode(utf-8))





In other words, leave it to ElementTree to manage the decoding and
encoding itself. Nice -- I like that solution.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-05 Thread eryksun
On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano st...@pearwood.info wrote:
 On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:

 ?xml version=1.0 encoding=ISO-8859-1 ?

 That surprises me. I thought XML was only valid in UTF-8? Or maybe that
 was wishful thinking.

JSON text SHALL be encoded in Unicode:

https://tools.ietf.org/html/rfc4627#section-3

For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the
MIME charset takes precedence. Section 8 has examples:

https://tools.ietf.org/html/rfc3023#section-8

So I was technically wrong to rely on the XML encoding (they happen to
be the same in this case). Instead you can create a parser with the
encoding from the header:

encoding = response.headers.getparam('charset')
parser = ET.XMLParser(encoding=encoding)
tree = ET.parse(response, parser)

The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1
and Unicode transport encodings. So it's probably better to transcode
to UTF-8 as Alex is doing, but then use a custom parser to override
the XML encoding:

encoding = response.headers.getparam('charset')
info = response.read().decode(encoding).encode('utf-8')

parser = ET.XMLParser(encoding='utf-8')
tree = ET.fromstring(info, parser)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread eryksun
On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net wrote:
 The output I get on an Ubuntu 12.4LTS system is as follows:
 alex@x301:~/Python/Parse$ ./IP_info.py3
 Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833


 I would have thought that utf-8 could handle the 'a-acute'.

b'\xe1' is Latin-1. Look in the response headers:

url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true'

 response = urllib.request.urlopen(url)
 response.headers.get_charsets()
['iso-8859-1']

 encoding = response.headers.get_charsets()[0]
 sp = response.read().decode(encoding).splitlines()
 sp[1]
'City: Bogotá'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider

On 2014-01-04 12:01, eryksun wrote:
On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net 
wrote:

.


b'\xe1' is Latin-1. Look in the response headers:

url = 
'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true'


 response = urllib.request.urlopen(url)
 response.headers.get_charsets()
['iso-8859-1']

 encoding = response.headers.get_charsets()[0]
 sp = response.read().decode(encoding).splitlines()
 sp[1]
'City: Bogotá'


Thank you very much.  Now things are more clear.
cheers,
alex
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano
On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
 Any suggestions as to a better way to handle the problem of encoding in 
 the following context would be appreciated.

Python gives you lots of useful information when errors occur, but 
unfortunately your code throws that information away and replaces it 
with a totally useless message:

 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)

Oh great. An exception was raised. What sort of exception? What error 
message did it have? Why did it happen? Nobody knows, because you throw 
it away.

Never, never, never do this. If you don't understand an exception, you 
have no business covering it up and hiding that it took place. Never use 
a bare try...except, always catch the *smallest* number of specific 
exception types that make sense. Better is to avoid catching exceptions 
at all: an exception (usually) means something has gone wrong. You 
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash, 
it just covers up the fact that an exception occured.


 The output I get on an Ubuntu 12.4LTS system is as follows:
 alex@x301:~/Python/Parse$ ./IP_info.py3
 Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833
 
 
 I would have thought that utf-8 could handle the 'a-acute'.

Of course it can:

py 'Bogotá'.encode('utf-8')
b'Bogot\xc3\xa1'

py b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode 
it using UTF-8 fails.

py b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: 
unexpected end of data


More to follow...




-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider

On 2014-01-04 15:52, Steven D'Aprano wrote:


Oh great. An exception was raised. What sort of exception? What error
message did it have? Why did it happen? Nobody knows, because you throw
it away.

Never, never, never do this. If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never 
use

a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.



The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
IP address is 201.234.178.62:
Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.


Of course it can:

py 'Bogotá'.encode('utf-8')


I'm interested in knowing how you were able to enter the above line 
(assuming you have a key board similar to mine.)




b'Bogot\xc3\xa1'

py b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode
it using UTF-8 fails.

py b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5:
unexpected end of data


More to follow...



I very much agree with your remarks.
In a pathetic attempt at self defence I just want to mention that what I 
presented wasn't what I thought was a solution.
Rather it was an attempt to figure out what the problem was as a 
preliminary step to fixing it.

With help from you and others, I was successful in doing this.
And for that help, I thank all list participants very much.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread eryksun
On Sat, Jan 4, 2014 at 7:15 PM, Alex Kleider aklei...@sonic.net wrote:

 py 'Bogotá'.encode('utf-8')

 I'm interested in knowing how you were able to enter the above line
 (assuming you have a key board similar to mine.)

I use an international keyboard layout:

https://en.wikipedia.org/wiki/QWERTY#US-International

One could also copy and paste from a printed literal:

 'Bogot\xe1'
'Bogotá'

Or more verbosely:

 'Bogot\N{latin small letter a with acute}'
   'Bogotá'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano
Following my previous email...

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
 Any suggestions as to a better way to handle the problem of encoding in 
 the following context would be appreciated.  The problem arose because 
 'Bogota' is spelt with an acute accent on the 'a'.

Eryksun has given the right answer for how to extract the encoding from 
the webpage's headers. That will help 9 times out of 10. But 
unfortunately sometimes webpages will lack an encoding header, or they 
will lie, or the text will be invalid for that encoding. What to do 
then?

Let's start by factoring out the repeated code in your giant for-loop 
into something more manageable and maintainable:

 sp = response.splitlines()
 country = city = lat = lon = ip = ''
 for item in sp:
 if item.startswith(bCountry:):
 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)
 country = item[9:]
 elif item.startswith(bCity:):
 try:
 city = item[6:].decode('utf-8')
 except:
 print(Exception raised.)
 city = item[6:]

and so on, becomes:

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
country = city = lat = lon = ip = ''
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key == 'Country':
country = value
elif key == 'City':
city = value
elif key == 'Latitude':
lat = value
elif key = Longitude:
lon = value
elif key = 'IP':
ip = value
else:
raise ValueError('unknown key %s found' % key)
return {Country : country,
City : city,
Lat : lat,
Long : lon,
IP : ip
}


But we can do better than that!

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
record = {Country: None, City: None, Latitude: None, 
  Longitude: None, IP: None}
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key in record:
record[key] = value
else:
raise ValueError('unknown key %s found' % key)
if None in list(record.values()):
for key, value in record.items():
if value is None: break
raise ValueError('missing key in record: %s' % key)
return record


This simplifies the code a lot, and adds some error-handling. It may be 
appropriate for your application to handle missing keys by using some 
default value, such as an empty string, or some other value that cannot 
be mistaken for an actual value, say *missing*. But since I don't know 
your application's needs, I'm going to leave that up to you. Better to 
start strict and loosen up later, than start too loose and never realise 
that errors are occuring.

I've also changed the keys Lat and Lon to Latitude and 
Longitude. If that's a problem, it's easy to fix. Just before 
returning the record, change the key:

record['Lat'] = record.pop('Latitude')

and similar for Longitude.

Now that the code is simpler to read and maintain, we can start dealing 
with the risk that the encoding will be missing or wrong.

A missing encoding is easy to handle: just pick a default encoding, and 
hope it is the right one. UTF-8 is a good choice. (It's the only 
*correct* choice, everybody should be using UTF-8, but alas they often 
don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header 
is missing, and you should be good.

How to deal with incorrect encodings? That can happen when the website 
creator *thinks* they are using a certain encoding, but somehow invalid 
bytes for that encoding creep into the data. That gives us a few 
different strategies:

(1) The third-party chardet module can analyse text and try to guess 
what encoding it *actually* is, rather than what encoding it claims to 
be. This is what Firefox and other web browsers do, because there are an 
awful lot of shitty websites out there. But it's not foolproof, so even 
if it guesses correctly, you still have to deal with invalid data.

(2) By default, the decode method will raise an exception. You can catch 
the exception and try again with a different encoding:

for codec in (encoding, 'utf-8', 'latin-1'):
try:
key = key.decode(codec)
except UnicodeDecodeError:
pass
else:
break

Latin-1 should be last, because it has the nice property that it will 
*always* succeed. That doesn't mean it will give you the right 
characters, as intended by the person who wrote the website, just that 
it will always give you *some* characters. They may be completely wrong, 
in other 

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano
On Sat, Jan 04, 2014 at 04:15:30PM -0800, Alex Kleider wrote:

 py 'Bogotá'.encode('utf-8')
 
 I'm interested in knowing how you were able to enter the above line 
 (assuming you have a key board similar to mine.)

I'm running Linux, and I use the KDE or Gnome character selector, 
depending on which computer I'm using. They give you a graphical window 
showing a screenful of characters at a time, depending on which 
application I'm using you can search for characters by name or property, 
then copy them into the clipboard to paste them into another 
application.

I can also use the compose key. My keyboard doesn't have an actual key 
labelled compose, but my system is set to use the right-hand Windows key 
(between Alt and the menu key) as the compose key. (Why the left-hand 
Windows key isn't set to do the same thing is a mystery to me.) So if I 
type:

Compose 'a

I get á.

The problem with the compose key is that it's not terribly intuitive. 
Sure, a few of them are: Compose 1 2 gives ½ but how do I get π (pi)? 
Compose p doesn't work.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider
A heartfelt thank you to those of you that have given me much to ponder 
with your helpful responses.
In the mean time I've rewritten my procedure using a different approach 
all together.  I'd be interested in knowing if you think it's worth 
keeping or do you suggest I use your revisions to my original hack?


I've been maintaining both a Python3 and a Python2.7 version.  The 
latter has actually opened my eyes to more complexities. Specifically 
the need to use unicode strings rather than Python2.7's default ascii.


Here it is:
alex@x301:~/Python/Parse$ cat ip_info.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-

import re
import urllib2

url_format_str = \
u'http://api.hostip.info/get_html.php?ip=%sposition=true'

info_exp = r
Country:[ ](?Pcountry.*)
[\n]
City:[ ](?Pcity.*)
[\n]
[\n]
Latitude:[ ](?Plat.*)
[\n]
Longitude:[ ](?Plon.*)
[\n]
IP:[ ](?Pip.*)

info_pattern = re.compile(info_exp, re.VERBOSE).search

def ip_info(ip_address):

Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!

response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')

info = info_pattern(response.read().decode(encoding))
return {Country : unicode(info.group(country)),
City : unicode(info.group(city)),
Lat : unicode(info.group(lat)),
Lon : unicode(info.group(lon)),
IP : unicode(info.group(ip))}

if __name__ == __main__:
print IP address is %(IP)s:
Country: %(Country)s;  City: %(City)s.
Lat/Long: %(Lat)s/%(Lon)s % ip_info(201.234.178.62)

Apart from soliciting your general comments, I'm also interested to know 
exactly what the line

# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using vim 
and I assume things are encoded as ascii?


I've discovered that with Ubuntu it's very easy to switch from English 
(US) to English (US, international with dead keys) with just two clicks 
so thanks for that tip as well.




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Danny Yoo
Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##
 import json
 import urllib
 response = urllib.urlopen('http://api.hostip.info/get_json.php')
 info = json.load(response)
 info
{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#


Best of wishes!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Danny Yoo
You were asking earlier about the line:

# -*- coding : utf -8 -*-

See PEP 263:

http://www.python.org/dev/peps/pep-0263/
http://docs.python.org/release/2.3/whatsnew/section-encodings.html

It's a line that tells Python how to interpret the bytes of your
source program.  It allows us to write unicode literal strings
embedded directly in the program source itself.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Dave Angel
On Sat, 04 Jan 2014 18:31:13 -0800, Alex Kleider aklei...@sonic.net 
wrote:

exactly what the line
# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using 
vim 

and I assume things are encoded as ascii?


I don't know vim specifically,  but I'm 99% sure it will let you 
specify the encoding,. Certainly emacs does, so I'd not expect vim to 
fall behind on such a fundamental point.   Anyway it's also likely 
that it defaults to utf for new files.  Anyway your job is to make 
sure that the encoding line matches what the editor is using.  Emacs 
also looks in the first few lines for that same encoding line, so if 
you format it carefully, it'll just work. Easy to test anyway for 
yourself.  Just paste some international characters into a literal 
string.


--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider

On 2014-01-04 18:44, Danny Yoo wrote:

Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##

import json
import urllib
response = urllib.urlopen('http://api.hostip.info/get_json.php')
info = json.load(response)
info

{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#




This strikes me as being the most elegant solution to date, and I thank 
you for it!


The problem is that the city name doesn't come in:

alex@x301:~/Python/Parse$ cat tutor.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-
# file: 'tutor.py'

Put your docstring here.

print Running 'tutor.py'...

import json
import urllib
response = urllib.urlopen\
 ('http://api.hostip.info/get_json.php?ip=201.234.178.62position=true')
info = json.load(response)
print info

alex@x301:~/Python/Parse$ ./tutor.py
Running 'tutor.py'...
{u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', 
u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': 
u'-75.2833'}


If I use my own IP the city comes in fine so there must still be some 
problem with the encoding.

should I be using
encoding = response.headers.getparam('charset')
in there somewhere?



Any ideas?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread eryksun
On Sat, Jan 4, 2014 at 11:16 PM, Alex Kleider aklei...@sonic.net wrote:
 {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code':
 u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'}

 If I use my own IP the city comes in fine so there must still be some
 problem with the encoding.

Report a bug in their JSON API. It's returning b'city:null'. I see
the same problem for www.msj.go.cr in San José, Costa Rica. It's
probably broken for all non-ASCII byte strings.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Danny Yoo
Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.  But I truly dislike XML for being implemented in ways that
are usually not fun to navigate: either the APIs or the encoded data
are usually convoluted enough to make it a chore rather than a
pleasure.

The beginning does look similar:

##
 import xml.etree.ElementTree as ET
 import urllib
 response = 
 urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;)
 tree = ET.parse(response)
 tree
xml.etree.ElementTree.ElementTree object at 0x185a2d0
##


Up to this point, not so bad.  But this is where it starts to look silly:

##
 tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text
'201.234.178.62'
 tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text
u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.


More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

##
 tree.find('.//{http://www.opengis.net/gml}coordinates').text
'-75.2833,10.4'
##

which is truly silly.  Why is the latitude and longitude not two
separate, structured values?  What is this XML buying us here, really
then?  I'm convinced that all the extraneous structure and complexity
in XML causes the people who work with it to stop caring, the result
being something that isn't for the benefit of either humans nor
computer programs.


Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant.  :P
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Danny Yoo
 then?  I'm convinced that all the extraneous structure and complexity
 in XML causes the people who work with it to stop caring, the result
 being something that isn't for the benefit of either humans nor
 computer programs.


... I'm sorry.  Sometimes I get grumpy when I haven't had a Snickers.

I should not have said the above here.  It isn't factual, and worse,
it insinuates an uncharitable intent to people who I do not know.
There's enough insinuation and insults out there in the world already:
I should not be contributing to those things.  For that, I apologize.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider

On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.  But I truly dislike XML for being implemented in ways that
are usually not fun to navigate: either the APIs or the encoded data
are usually convoluted enough to make it a chore rather than a
pleasure.

The beginning does look similar:

##

import xml.etree.ElementTree as ET
import urllib
response = 
urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;)

tree = ET.parse(response)
tree

xml.etree.ElementTree.ElementTree object at 0x185a2d0
##


Up to this point, not so bad.  But this is where it starts to look 
silly:


##

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text

'201.234.178.62'

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text

u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.


More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

##

tree.find('.//{http://www.opengis.net/gml}coordinates').text

'-75.2833,10.4'
##

which is truly silly.  Why is the latitude and longitude not two
separate, structured values?  What is this XML buying us here, really
then?  I'm convinced that all the extraneous structure and complexity
in XML causes the people who work with it to stop caring, the result
being something that isn't for the benefit of either humans nor
computer programs.


Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant.  :P


Not a rant at all.

As it turns out, one of the other things that have interested me of late 
is docbook, an xml dialect (I think this is the correct way to express 
it.)  I've found it very useful and so do not share your distaste for 
xml although one can't disagree with the points you've made with regard 
to xml as a solution to the problem under discussion.
I've not played with the python xml interfaces before so this will be a 
good project for me.


Thanks.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding question

2014-01-04 Thread Alex Kleider

On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.


Well, I've tried the xml approach which seems promising but still I get 
an encoding related error.
Is there a bug in the xml.etree module (not very likely, me thinks) or 
am I doing something wrong?
There's no denying that the whole encoding issue is still not completely 
clear to me in spite of having devoted a lot of time to trying to grasp 
all that's involved.


Here's what I've got:

alex@x301:~/Python/Parse$ cat ip_xml.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-
# file: 'ip_xml.py'

import urllib2
import xml.etree.ElementTree as ET


url_format_str = \
u'http://api.hostip.info/?ip=%sposition=true'

def ip_info(ip_address):
response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
print 'encoding' is '%s'. % (encoding, )
info = unicode(response.read().decode(encoding))
n = info.find('\n')
print location of first newline is %s. % (n, )
xml = info[n+1:]
print 'xml' is '%s'. % (xml, )

tree = ET.fromstring(xml)
root = tree.getroot()   # Here's where it blows up!!!
print 'root' is '%s', with the following children: % (root, )
for child in root:
print child.tag, child.attrib
print END of CHILDREN
return info

if __name__ == __main__:
info = ip_info(201.234.178.62)

alex@x301:~/Python/Parse$ ./ip_xml.py
'encoding' is 'iso-8859-1'.
location of first newline is 44.
'xml' is 'HostipLookupResultSet version=1.0.1 
xmlns:gml=http://www.opengis.net/gml; 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
xsi:noNamespaceSchemaLocation=http://www.hostip.info/api/hostip-1.0.1.xsd;

 gml:descriptionThis is the Hostip Lookup Service/gml:description
 gml:namehostip/gml:name
 gml:boundedBy
  gml:Nullinapplicable/gml:Null
 /gml:boundedBy
 gml:featureMember
  Hostip
   ip201.234.178.62/ip
   gml:nameBogotá/gml:name
   countryNameCOLOMBIA/countryName
   countryAbbrevCO/countryAbbrev
   !-- Co-ordinates are available as lng,lat --
   ipLocation
gml:pointProperty
 gml:Point srsName=http://www.opengis.net/gml/srs/epsg.xml#4326;
  gml:coordinates-75.2833,10.4/gml:coordinates
 /gml:Point
/gml:pointProperty
   /ipLocation
  /Hostip
 /gml:featureMember
/HostipLookupResultSet
'.
Traceback (most recent call last):
  File ./ip_xml.py, line 33, in module
info = ip_info(201.234.178.62)
  File ./ip_xml.py, line 23, in ip_info
tree = ET.fromstring(xml)
  File /usr/lib/python2.7/xml/etree/ElementTree.py, line 1301, in XML
parser.feed(text)
  File /usr/lib/python2.7/xml/etree/ElementTree.py, line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in 
position 456: ordinal not in range(128)




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding error when reading text files in Python 3

2012-07-28 Thread Steven D'Aprano

Dat Huynh wrote:

Dear all,

I have written a simple application by Python to read data from text files.

Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
I don't know why it does not run on Python version 3 while it runs
well on Python 2.


Python 2 is more forgiving of beginner errors when dealing with text and 
bytes, but makes it harder to deal with text correctly.


Python 3 makes it easier to deal with text correctly, but is less forgiving.

When you read from a file in Python 2, it will give you *something*, even if 
it is the wrong thing. It will not give an decoding error, even if the text 
you are reading is not valid text. It will just give you junk bytes, sometimes 
known as moji-bake.


Python 3 no longer does that. It tells you when there is a problem, so you can 
fix it.




Could you please tell me how I can run it on python 3?
Following is my Python code.

 --
   for subdir, dirs, files in os.walk(rootdir):
for file in files:
print(Processing [ +file +]...\n )
f = open(rootdir+file, 'r')
data = f.read()
f.close()
print(data)
--

This is the error message:

[...]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
4980: ordinal not in range(128)



This tells you that you are reading a non-ASCII file but haven't told Python 
what encoding to use, so by default Python uses ASCII.


Do you know what encoding the file is?

Do you understand about Unicode text and bytes? If not, I suggest you read 
this article:


http://www.joelonsoftware.com/articles/Unicode.html


In Python 3, you can either tell Python what encoding to use:

f = open(rootdir+file, 'r', encoding='utf8')  # for example

or you can set an error handler:

f = open(rootdir+file, 'r', errors='ignore')  # for example

or both

f = open(rootdir+file, 'r', encoding='ascii', errors='replace')


You can see the list of encodings and error handlers here:

http://docs.python.org/py3k/library/codecs.html


Unfortunately, Python 2 does not support this using the built-in open 
function. Instead, you have to uses codecs.open instead of the built-in open, 
like this:


import codecs
f = codecs.open(rootdir+file, 'r', encoding='utf8')  # for example

which fortunately works in both Python 2 or 3.


Or you can read the file in binary mode, and then decode it into text:

f = open(rootdir+file, 'rb')
data = f.read()
f.close()
text = data.decode('cp866', 'replace')
print(text)


If you don't know the encoding, you can try opening the file in Firefox or 
Internet Explorer and see if they can guess it, or you can use the chardet 
library in Python.


http://pypi.python.org/pypi/chardet

Or if you don't care about getting moji-bake, you can pretend that the file is 
encoded using Latin-1. That will pretty much read anything, although what it 
gives you may be junk.




--
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding error when reading text files in Python 3

2012-07-28 Thread Dat Huynh
I change my code and it runs on Python 3 now.

   f = open(rootdir+file, 'rb')
  data = f.read().decode('utf8', 'ignore')

Thank you very much.
Sincerely,
Dat.




On Sat, Jul 28, 2012 at 6:09 PM, Steven D'Aprano st...@pearwood.info wrote:
 Dat Huynh wrote:

 Dear all,

 I have written a simple application by Python to read data from text
 files.

 Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop.
 I don't know why it does not run on Python version 3 while it runs
 well on Python 2.


 Python 2 is more forgiving of beginner errors when dealing with text and
 bytes, but makes it harder to deal with text correctly.

 Python 3 makes it easier to deal with text correctly, but is less forgiving.

 When you read from a file in Python 2, it will give you *something*, even if
 it is the wrong thing. It will not give an decoding error, even if the text
 you are reading is not valid text. It will just give you junk bytes,
 sometimes known as moji-bake.

 Python 3 no longer does that. It tells you when there is a problem, so you
 can fix it.



 Could you please tell me how I can run it on python 3?
 Following is my Python code.

  --
for subdir, dirs, files in os.walk(rootdir):
 for file in files:
 print(Processing [ +file +]...\n )
 f = open(rootdir+file, 'r')
 data = f.read()
 f.close()
 print(data)
 --

 This is the error message:

 [...]

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position
 4980: ordinal not in range(128)



 This tells you that you are reading a non-ASCII file but haven't told Python
 what encoding to use, so by default Python uses ASCII.

 Do you know what encoding the file is?

 Do you understand about Unicode text and bytes? If not, I suggest you read
 this article:

 http://www.joelonsoftware.com/articles/Unicode.html


 In Python 3, you can either tell Python what encoding to use:

 f = open(rootdir+file, 'r', encoding='utf8')  # for example

 or you can set an error handler:

 f = open(rootdir+file, 'r', errors='ignore')  # for example

 or both

 f = open(rootdir+file, 'r', encoding='ascii', errors='replace')


 You can see the list of encodings and error handlers here:

 http://docs.python.org/py3k/library/codecs.html


 Unfortunately, Python 2 does not support this using the built-in open
 function. Instead, you have to uses codecs.open instead of the built-in
 open, like this:

 import codecs
 f = codecs.open(rootdir+file, 'r', encoding='utf8')  # for example

 which fortunately works in both Python 2 or 3.


 Or you can read the file in binary mode, and then decode it into text:

 f = open(rootdir+file, 'rb')
 data = f.read()
 f.close()
 text = data.decode('cp866', 'replace')
 print(text)


 If you don't know the encoding, you can try opening the file in Firefox or
 Internet Explorer and see if they can guess it, or you can use the chardet
 library in Python.

 http://pypi.python.org/pypi/chardet

 Or if you don't care about getting moji-bake, you can pretend that the file
 is encoded using Latin-1. That will pretty much read anything, although what
 it gives you may be junk.



 --
 Steven

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2011-11-18 Thread Max S.
Well, I am assuming that by this you mean converting user input into a
string, and then extracting the numerals (0-9) from it.  Next time, please
tell us your version of Python.  I'll do my best to help with this.  You
might try the following:

the_input = input(Insert string here: ) # change to raw_input in python 2
after = 
for char in the_input:
try:
char = int(char)
except:
after += char

If other symbols might be in the string ($, @, etc.), then you might use

the_input = input('Insert string here: ') # change to raw_input in python 2
after = ''
not_allowed = '1234567890-=!@#$%^**()_+,./?`~[]{}\\|'
for char in the_input:
if char in not_allowed:
pass
else:
after += char

This method requires more typing, but it works with a wider variety of
characters.  Hopefully this helped.

On Thu, Nov 17, 2011 at 8:45 PM, Nidian Job-Smith nidia...@hotmail.comwrote:


 Hi all,

 In my programme I am encoding what the user has in-putted.

 What the user inputs will in a string, which might a mixture of letters
 and numbers.

 However I only want the letters to be encoded.


 Does any-one how I can only allow the characters to be encoded ??

 Big thanks,



 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2011-11-18 Thread Prasad, Ramit
On 11/17/2011 8:45 PM, Nidian Job-Smith wrote: 

Hi all, 

In my programme I am encoding what the user has in-putted. 

What the user inputs will in a string, which might a mixture of letters and 
numbers.

However I only want the letters to be encoded. 


I am assuming that you meant only accept characters and not actual
text encoding. The following example is untested and is limited. It 
will not really work with non-ASCII letters (i.e. Unicode).

import string
input_string = raw_input( 'Enter something' ) #use input in Python3
final_input = [] # append to a list instead of concatenating a string
 # because it is faster to ''.join( list )
for char in input_string:
if char in string.letters:
final_input.append( char )
input_string = ''.join( final_input )



Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2011-11-17 Thread bob gailer

On 11/17/2011 8:45 PM, Nidian Job-Smith wrote:


Hi all,

In my programme I am encoding what the user has in-putted.

What the user inputs will in a string, which might a mixture of 
letters and numbers.


However I only want the letters to be encoded.


Does any-one how I can only allow the characters to be encoded ??


Your question makes no sense to me. Please explain what you mean by 
encoding letters?


An example of input and output might also help.

Be sure to reply-all.

--
Bob Gailer
919-636-4239
Chapel Hill NC

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread Giorgio
2010/3/7 spir denis.s...@gmail.com


  Oh, right. And, if i'm not wrong B is an UTF8 string decoded to unicode
 (due
  to the coding: statement at the top of the file) and re-encoded to latin1

 Si! :-)


Ahah. Ok, Grazie!

One more question: Amazon SimpleDB only accepts UTF8.

So, let's say i have to put into an image file:

filestream = file.read()
filetoput = filestream.encode('utf-8')

Do you think this is ok?

Oh, of course everything url-encoded then

Giorgio



 Denis
 --
 

 la vita e estrany

 spir.wikidot.com




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread spir
On Sun, 7 Mar 2010 13:23:12 +0100
Giorgio anothernetfel...@gmail.com wrote:

 One more question: Amazon SimpleDB only accepts UTF8.
[...]
 filestream = file.read()
 filetoput = filestream.encode('utf-8')

No! What is the content of the file? Do you think it can be a pure python 
representation of a unicode text?

uContent = inFile.read().decode(***format***)
process, if any
outFile.write(uContent.encode('utf-8'))

input --decode-- process --encode-- output

This gives me an idea: when working with unicode, it would be cool to have an 
optional format parameter for file.read() and write. So, the above would be:

uContent = inFile.read(***format***)
process, if any
outFile.write(uContent, 'utf-8')

Or, maybe even better, the format could be given as third parameter of file 
open(); then any read or write operation would directly convert from/to the 
said format. What do you all think?


denis
-- 


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread python
 Or, maybe even better, the format could be given as third parameter of file 
 open(); then any read or write operation would directly convert from/to the 
 said format. What do you all think?

See the codecs.open() command as an alternative to open().

With all the hassles of encoding, I'm puzzled why anyone would use the
regular open() for anything but binary operations.

Malcolm



- Original message -
From: spir denis.s...@gmail.com
To: Python tutor tutor@python.org
Date: Sun, 7 Mar 2010 14:29:11 +0100
Subject: Re: [Tutor] Encoding

On Sun, 7 Mar 2010 13:23:12 +0100
Giorgio anothernetfel...@gmail.com wrote:

 One more question: Amazon SimpleDB only accepts UTF8.
[...]
 filestream = file.read()
 filetoput = filestream.encode('utf-8')

No! What is the content of the file? Do you think it can be a pure
python representation of a unicode text?

uContent = inFile.read().decode(***format***)
process, if any
outFile.write(uContent.encode('utf-8'))

input --decode-- process --encode-- output

This gives me an idea: when working with unicode, it would be cool to
have an optional format parameter for file.read() and write. So, the
above would be:

uContent = inFile.read(***format***)
process, if any
outFile.write(uContent, 'utf-8')

Or, maybe even better, the format could be given as third parameter of
file open(); then any read or write operation would directly convert
from/to the said format. What do you all think?


denis
-- 


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread Dave Angel

Giorgio wrote:

2010/3/7 spir denis.s...@gmail.com

  
One more question: Amazon SimpleDB only accepts UTF8.


So, let's say i have to put into an image file:

  
Do you mean a binary file with image data, such as a jpeg?  In that 
case, an emphatic - NO.  not even close.

filestream = file.read()
filetoput = filestream.encode('utf-8')

Do you think this is ok?

Oh, of course everything url-encoded then

Giorgio


  
Encoding binary data with utf-8 wouldn't make any sense, even if you did 
have the right semantics for a text file. 

Next problem, 'file' is a built-in keyword.  So if you write what you 
describe, you're trying to call a non-static function with a class 
object, which will error.



Those two lines don't make any sense by themselves.  Show us some 
context, and we can more sensibly comment on them.  And try not to use 
names that hide built-in keywords, or Python stdlib names.


If you're trying to store binary data in a repository that only permits 
text, it's not enough to pretend to convert it to UTF-8.  You need to do 
some other escaping, such as UUENCODE, that transforms the binary data 
into something resembling text.  Then you may or may not need to encode 
that text with utf-8, depending on its character set.



DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread Giorgio
2010/3/7 Dave Angel da...@ieee.org


 Those two lines don't make any sense by themselves.  Show us some context,
 and we can more sensibly comment on them.  And try not to use names that
 hide built-in keywords, or Python stdlib names.


Hi Dave,

I'm considering Amazon SimpleDB as an alternative to PGSQL, but i need to
store blobs.

Amazon's FAQs says that:

Q: What kind of data can I store?
You can store any UTF-8 string data in Amazon SimpleDB. Please refer
to the Amazon
Web Services Customer Agreement http://aws.amazon.com/agreement for
details.

This is the problem. Any idea?


 DaveA


Giorgio



-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread Dave Angel

Giorgio wrote:

2010/3/7 Dave Angel da...@ieee.org

  

Those two lines don't make any sense by themselves.  Show us some context,
and we can more sensibly comment on them.  And try not to use names that
hide built-in keywords, or Python stdlib names.




Hi Dave,

I'm considering Amazon SimpleDB as an alternative to PGSQL, but i need to
store blobs.

Amazon's FAQs says that:

Q: What kind of data can I store?
You can store any UTF-8 string data in Amazon SimpleDB. Please refer
to the Amazon
Web Services Customer Agreement http://aws.amazon.com/agreement for
details.

This is the problem. Any idea?


  

DaveA




Giorgio



  
You still didn't provide the full context.  Are you trying to do store 
binary data, or not?


Assuming you are, you could do the UUENCODE suggestion I made.  Or use 
base64:


base64.encodestring(/s/)   wlll turn binary data into (larger) binary 
data, also considered a string.  The latter is ASCII, so it's irrelevant 
whether it's considered utf-8 or otherwise.  You store the resulting 
string in your database, and use  base64.decodestring(s) to reconstruct 
your original.


There's 50 other ways, some more efficient, but this may be the simplest.

DaveA


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-07 Thread Stefan Behnel

Giorgio, 05.03.2010 14:56:

What i don't understand is why:

s = uciao è ciao is converting a string to unicode, decoding it from the
specified encoding but

t = ciao è ciao
t = unicode(t)

That should do exactly the same instead of using the specified encoding
always assume that if i'm not telling the function what the encoding is, i'm
using ASCII.

Is this a bug?


Did you read the Unicode tutorial at the link I posted? Here's the link again:

http://www.amk.ca/python/howto/unicode

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-06 Thread Giorgio
2010/3/5 Dave Angel da...@ieee.org

I'm not angry, and I'm sorry if I seemed angry.  Tone of voice is hard to
 convey in a text message.


Ok, sorry. I've misunderstood your mail :D


 I'm still not sure whether your confusion is to what the rules are, or why
 the rules were made that way.


WHY the rules are made that way. But now it's clear.

2010/3/6 Mark Tolonen metolone+gm...@gmail.com metolone%2bgm...@gmail.com



  Maybe this will help:

   # coding: utf-8

   a = ciao è ciao
   b = uciao è ciao.encode('latin-1')

 a is a UTF-8 string, due to #coding line in source.
 b is a latin-1 string, due to explicit encoding.


Oh, right. And, if i'm not wrong B is an UTF8 string decoded to unicode (due
to the coding: statement at the top of the file) and re-encoded to latin1


 -Mark


Thankyou again

Giorgio




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-05 Thread Giorgio


 Ok,so you confirm that:

 s = uciao è ciao will use the file specified encoding, and that

 t = ciao è ciao
 t = unicode(t)

 Will use, if not specified in the function, ASCII. It will ignore the
 encoding I specified on the top of the file. right?



 A literal  u string, and only such a (unicode) literal string, is
 affected by the encoding specification.  Once some bytes have been stored in
 a 8 bit string, the system does *not* keep track of where they came from,
 and any conversions then (even if they're on an adjacent line) will use the
 default decoder.  This is a logical example of what somebody said earlier on
 the thread -- decode any data to unicode as early as possible, and deal only
 with unicode strings in the program.  Then, if necessary, encode them into
 whatever output form immediately before (or while) outputting them.



 Ok Dave, What i don't understand is why:

s = uciao è ciao is converting a string to unicode, decoding it from the
specified encoding but

t = ciao è ciao
t = unicode(t)

That should do exactly the same instead of using the specified encoding
always assume that if i'm not telling the function what the encoding is, i'm
using ASCII.

Is this a bug?

Giorgio
-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-05 Thread Dave Angel

Giorgio wrote:


Ok,so you confirm that:

s = uciao è ciao will use the file specified encoding, and that

t = ciao è ciao
t = unicode(t)

Will use, if not specified in the function, ASCII. It will ignore the
encoding I specified on the top of the file. right?



  

A literal  u string, and only such a (unicode) literal string, is
affected by the encoding specification.  Once some bytes have been stored in
a 8 bit string, the system does *not* keep track of where they came from,
and any conversions then (even if they're on an adjacent line) will use the
default decoder.  This is a logical example of what somebody said earlier on
the thread -- decode any data to unicode as early as possible, and deal only
with unicode strings in the program.  Then, if necessary, encode them into
whatever output form immediately before (or while) outputting them.





 Ok Dave, What i don't understand is why:

s = uciao è ciao is converting a string to unicode, decoding it from the
specified encoding but

t = ciao è ciao
t = unicode(t)

That should do exactly the same instead of using the specified encoding
always assume that if i'm not telling the function what the encoding is, i'm
using ASCII.

Is this a bug?

Giorgio
  
In other words, you don't understand my paragraph above.  Once the 
string is stored in t as an 8 bit string, it's irrelevant what the 
source file encoding was.  If you then (whether it's in the next line, 
or ten thousand calls later) try to convert to unicode without 
specifying a decoder, it uses the default encoder, which is a 
application wide thing, and not a source file thing.  To see what it is 
on your system, use sys.getdefaultencoding().


There's an encoding specified or implied for each source file of an 
application, and they need not be the same.  It affects string literals 
that come from that particular file. It does not affect any other 
conversions, as far as I know.  For that matter, many of those source 
files may not even exist any more by the time the application is run.


There are also encodings attached to each file object, I believe, though 
I've got no experience with that.  So sys.stdout would have an encoding 
defined, and any unicode strings passed to it would be converted using 
that specification.


The point is that there isn't just one global value, and it's a good 
thing.  You should figure everywhere characters come into  your program 
(eg. source files, raw_input, file i/o...) and everywhere characters go 
out of your program, and deal with each of them individually.  Don't 
store anything internally as strings, and you won't create the ambiguity 
you have with your 't' variable above.


DaveA
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-05 Thread Giorgio
2010/3/5 Dave Angel da...@ieee.org

 In other words, you don't understand my paragraph above.


Maybe. But please don't be angry. I'm here to learn, and as i've run into a
very difficult concept I want to fully undestand it.


 Once the string is stored in t as an 8 bit string, it's irrelevant what the
 source file encoding was.


Ok, you've said this 2 times, but, please, can you tell me why? I think
that's the key passage to understand how encoding of strings works. The
source file encoding affects all file lines, also strings. If my encoding is
UTF8 python will read the string ciao è ciao as 'ciao \xc3\xa8 ciao' but
if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?

I think the problem is that i can't find any difference between 2 lines
quoted above:

a = uciao è ciao

and

a = ciao è ciao
a = unicode(a)


 If you then (whether it's in the next line, or ten thousand calls later)
 try to convert to unicode without specifying a decoder, it uses the default
 encoder, which is a application wide thing, and not a source file thing.  To
 see what it is on your system, use sys.getdefaultencoding().


And this is ok. Spir said that it uses ASCII, you now say that it uses the
default encoder. I think that ASCII on spir's system is the default encoder
so.


 The point is that there isn't just one global value, and it's a good thing.
  You should figure everywhere characters come into  your program (eg. source
 files, raw_input, file i/o...) and everywhere characters go out of your
 program, and deal with each of them individually.


Ok. But it always happen this way. I hardly ever have to work with strings
defined in the file.


 Don't store anything internally as strings, and you won't create the
 ambiguity you have with your 't' variable above.

 DaveA


Thankyou Dave

Giorgio



-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-05 Thread Dave Angel

Giorgio wrote:

2010/3/5 Dave Angel da...@ieee.org
  

In other words, you don't understand my paragraph above.




Maybe. But please don't be angry. I'm here to learn, and as i've run into a
very difficult concept I want to fully undestand it.


  
I'm not angry, and I'm sorry if I seemed angry.  Tone of voice is hard 
to convey in a text message.

Once the string is stored in t as an 8 bit string, it's irrelevant what the
source file encoding was.




Ok, you've said this 2 times, but, please, can you tell me why? I think
that's the key passage to understand how encoding of strings works. The
source file encoding affects all file lines, also strings.

Nope, not strings.  It only affects string literals.

 If my encoding is
UTF8 python will read the string ciao è ciao as 'ciao \xc3\xa8 ciao' but
if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant?

I think the problem is that i can't find any difference between 2 lines
quoted above:

s = uciao è ciao

and

t = ciao è ciao
c = unicode(t)

[**  I took the liberty of making the variable names different so I can refer 
to them **]
  
I'm still not sure whether your confusion is to what the rules are, or 
why the rules were made that way.  The rules are that an unqualified 
conversion, such as the unicode() function with no second argument, uses 
the default encoding, in strict mode.  Thus the error.


Quoting the help: 
If no optional parameters are given, unicode() will mimic the behaviour 
of str() except that it returns Unicode strings instead of 8-bit 
strings. More precisely, if /object/ is a Unicode string or subclass it 
will return that Unicode string without any additional decoding applied.


For objects which provide a __unicode__() 
../reference/datamodel.html#object.__unicode__ method, it will call 
this method without arguments to create a Unicode string. For all other 
objects, the 8-bit string version or representation is requested and 
then converted to a Unicode string using the codec for the default 
encoding in 'strict' mode.



As for why the rules are that, I'd have to ask you what you'd prefer.  
The unicode() function has no idea that t was created from a literal 
(and no idea what source file that literal was in), so it has to pick 
some coding, called the default coding.  The designers decided to use a 
default encoding of ASCII, because manipulating ASCII strings is always 
safe, while many functions won't behave as expected when given UTF-8 
encoded strings.  For example, what's the 7th character of t ?  That is 
not necessarily the same as the 7th character of s, since one or more of 
the characters in between might have taken up multiple bytes in s.  That 
doesn't happen to be the case for your accented character, but would be 
for some other European symbols, and certainly for other languages as well.

If you then (whether it's in the next line, or ten thousand calls later)
try to convert to unicode without specifying a decoder, it uses the default
encoder, which is a application wide thing, and not a source file thing.  To
see what it is on your system, use sys.getdefaultencoding().




And this is ok. Spir said that it uses ASCII, you now say that it uses the
default encoder. I think that ASCII on spir's system is the default encoder
so.


  
I don't know, but I think it's the default in every country, at least on 
version 2.6.  It might make sense to get some value from the OS that 
defined the locally preferred encoding, but then a program that worked 
fine in one locale might fail miserably in another.

The point is that there isn't just one global value, and it's a good thing.
 You should figure everywhere characters come into  your program (eg. source
files, raw_input, file i/o...) and everywhere characters go out of your
program, and deal with each of them individually.




Ok. But it always happen this way. I hardly ever have to work with strings
defined in the file.

  
Not sure what you mean by the file.  If you mean the source file, 
that's what your examples are about.   If you mean a data file, that's 
dealt with differently.
  

Don't store anything internally as strings, and you won't create the
ambiguity you have with your 't' variable above.

DaveA




Thankyou Dave

Giorgio



  


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-05 Thread Mark Tolonen


Giorgio anothernetfel...@gmail.com wrote in message 
news:23ce85921003050915p1a084c0co73d973282d8fb...@mail.gmail.com...

2010/3/5 Dave Angel da...@ieee.org

I think the problem is that i can't find any difference between 2 lines
quoted above:

a = uciao è ciao

and

a = ciao è ciao
a = unicode(a)


Maybe this will help:

   # coding: utf-8

   a = ciao è ciao
   b = uciao è ciao.encode('latin-1')

a is a UTF-8 string, due to #coding line in source.
b is a latin-1 string, due to explicit encoding.

   a = unicode(a)
   b = unicode(b)

Now what will happen?  unicode() uses 'ascii' if not specified, because it 
has no idea of the encoding of a or b.  Only the programmer knows.  It does 
not use the #coding line to decide.


#coding is *only* used to specify the encoding the source file is saved in, 
so when Python executes the script, reads the source and parses a literal 
Unicode string (u'...', u..., etc.) the bytes read from the file are 
decoded using the #coding specified.


If Python parses a byte string ('...', ..., etc.) the bytes read from the 
file are stored directly in the string.  The coding line isn't even used. 
The bytes will be exactly what was saved in the file between the quotes.


-Mark


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-04 Thread Albert-Jan Roskam
Hi,
 
For everybody who's having trouble understanding encoding, I found this page 
useful:
http://evanjones.ca/python-utf8.html

Cheers!!
Albert-Jan

~~
In the face of ambiguity, refuse the temptation to guess.
~~

--- On Thu, 3/4/10, spir denis.s...@gmail.com wrote:


From: spir denis.s...@gmail.com
Subject: Re: [Tutor] Encoding
To: tutor@python.org
Date: Thursday, March 4, 2010, 8:01 AM


On Wed, 3 Mar 2010 20:44:51 +0100
Giorgio anothernetfel...@gmail.com wrote:

 Please let me post the third update O_o. You can forgot other 2, i'll put
 them into this email.
 
 ---
  s = ciao è ciao
  print s
 ciao è ciao
  s.encode('utf-8')
 
 Traceback (most recent call last):
   File pyshell#2, line 1, in module
     s.encode('utf-8')
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5:
 ordinal not in range(128)
 ---
 
 I am getting more and more confused.

What you enter on the terminal prompt is text, encoded in a format (ascii, 
latin*, utf*,...) that probably depends on your system locale. As this format 
is always a sequence of bytes, python stores it as a plain str:
 s = ciao è ciao
 s,type(s)
('ciao \xc3\xa8 ciao', type 'str')
My system is parametered in utf8. c3-a8 is the repr of 'é' in utf8. It needs 2 
bytes because of the rules of utf8 itself. Right?

To get a python unicode string, it must be decoded from its format, for me utf8:
 u = s.decode(utf8)
 u,type(u)
(u'ciao \xe8 ciao', type 'unicode')
e8 is the unicode code for 'è' (decimal 232). You can check that in tables. It 
needs here one byte only because 232255.

[comparison with php]

 Ok, now, the point is: you (and the manual) said that this line:
 
 s = ugiorgio è giorgio
 
 will convert the string as unicode.

Yes and no: it will convert it *into* a unicode string, in the sense of a 
python representation for universal text. When seeing u... , python will 
automagically *decode* the part in ..., taking as source format the one you 
indicate in a pseudo-comment on top of you code file, eg:
# coding: utf8
Else I guess the default is the system's locale format? Or ascii? Someone knows?
So, in my case ugiorgio è giorgio is equivalent to giorgio è 
giorgio.decode(utf8):
 u1 = ugiorgio è giorgio
 u2 = giorgio è giorgio.decode(utf8)
 u1,u2
(u'giorgio \xe8 giorgio', u'giorgio \xe8 giorgio')
 u1 == u2
True

 But also said that the part between 
 will be encoded with my editor BEFORE getting encoded in unicode by python.

will be encoded with my editor BEFORE getting encoded in unicode by python
--
will be encoded *by* my editor BEFORE getting *decoded* *into* unicode by python

 So please pay attention to this example:
 
 My editor is working in UTF8. I create this:
 
 c = giorgio è giorgio // This will be an UTF8 string because of the file's
 encoding
Right.
 d = unicode(c) // This will be an unicode string
 e = c.encode() // How will be encoded this string? If PY is working like PHP
 this will be an utf8 string.

Have you tried it?
 c = giorgio è giorgio 
 d = unicode(c)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal 
not in range(128)

Now, tell us why! (the answer is below *)

 Can you help me?
 
 Thankyou VERY much
 
 Giorgio


Denis

(*)
You don't tell which format the source string is encoded in. By default, python 
uses ascii (I know, it's stupid) which max code is 127. So, 'é' is not 
accepted. Now, if I give a format, all works fine:
 d = unicode(c,utf8)
 d
u'giorgio \xe8 giorgio'

Note: unicode(c,format) is an alias for c.decode(format):
 c.decode(utf8)
u'giorgio \xe8 giorgio'


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor



  ___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-04 Thread Giorgio
Thankyou.

You have clarificated many things in those emails. Due to high numbers of
messages i won't quote everything.

So, as i can clearly understand reading last spir's post, python gets
strings encoded by my editor and to convert them to unicode i need to
specify HOW they're encoded. This makes clear this example:

c = giorgio è giorgio
d = c.decode(utf8)

I create an utf8 string, and to convert it into unicode i need to tell
python that the string IS utf8.

Just don't understand why in my Windows XP computer in Python IDLE doesn't
work:

  RESTART


 c = giorgio è giorgio
 c
'giorgio \xe8 giorgio'
 d = c.decode(utf8)

Traceback (most recent call last):
  File pyshell#10, line 1, in module
d = c.decode(utf8)
  File C:\Python26\lib\encodings\utf_8.py, line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10:
invalid data


In IDLE options i've set encoding to UTF8 of course. I also have some linux
servers where i can try the IDLE but Putty doesn't seem to support UTF8.

But, let's continue:

In that example i've specified UTF8 in the decode method. If i hadn't set it
python would have taken the one i specified in the second line of the file,
right?

As last point, i can't understand why this works:

 a = ugiorgio è giorgio
 a
u'giorgio \xe8 giorgio'

And this one doesn't:

 a = giorgio è giorgio
 b = unicode(a)

Traceback (most recent call last):
  File pyshell#14, line 1, in module
b = unicode(a)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 8:
ordinal not in range(128)


The second doesn't work because i have not told python how the string was
encoded. But in the first too i haven't specified the encoding O_O.

Thankyou again for your help.

Giorgio
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-04 Thread spir
On Thu, 4 Mar 2010 15:13:44 +0100
Giorgio anothernetfel...@gmail.com wrote:

 Thankyou.
 
 You have clarificated many things in those emails. Due to high numbers of
 messages i won't quote everything.
 
 So, as i can clearly understand reading last spir's post, python gets
 strings encoded by my editor and to convert them to unicode i need to
 specify HOW they're encoded. This makes clear this example:
 
 c = giorgio è giorgio
 d = c.decode(utf8)
 
 I create an utf8 string, and to convert it into unicode i need to tell
 python that the string IS utf8.
 
 Just don't understand why in my Windows XP computer in Python IDLE doesn't
 work:
 
   RESTART
 
 
  c = giorgio è giorgio
  c
 'giorgio \xe8 giorgio'
  d = c.decode(utf8)
 
 Traceback (most recent call last):
   File pyshell#10, line 1, in module
 d = c.decode(utf8)
   File C:\Python26\lib\encodings\utf_8.py, line 16, in decode
 return codecs.utf_8_decode(input, errors, True)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10:
 invalid data
 

How do you know your win XP terminal is configured to deal with text using 
utf8? Why do you think it should? Don't know much about windows, but I've read 
they have their own character sets (and format?). So, probably, if you haven't 
personalized it, it won't. (Conversely, I guess Macs use utf8 as default. 
Someone confirms?)
In other words, c is not a piece of text in utf8.

 In IDLE options i've set encoding to UTF8 of course. I also have some linux
 servers where i can try the IDLE but Putty doesn't seem to support UTF8.
 
 But, let's continue:
 
 In that example i've specified UTF8 in the decode method. If i hadn't set it
 python would have taken the one i specified in the second line of the file,
 right?
 
 As last point, i can't understand why this works:
 
  a = ugiorgio è giorgio
  a
 u'giorgio \xe8 giorgio'

This trial uses the default format of your system. It does the same as
   a = giorgio è giorgio.encode(default_format)
It's a shorcut for ustring *literals* (constants), directly expressed by the 
programmer. In source code, it would use the format specified on top of the 
file.

 And this one doesn't:
 
  a = giorgio è giorgio
  b = unicode(a)
 
 Traceback (most recent call last):
   File pyshell#14, line 1, in module
 b = unicode(a)
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 8:
 ordinal not in range(128)

This trial uses ascii because you give no format (yes, it can be seen as a 
flaw). It does the same as
   a = giorgio è giorgio.encode(ascii)

 
 
 The second doesn't work because i have not told python how the string was
 encoded. But in the first too i haven't specified the encoding O_O.
 
 Thankyou again for your help.
 
 Giorgio


Denis
-- 


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-04 Thread Giorgio
2010/3/4 spir denis.s...@gmail.com



 How do you know your win XP terminal is configured to deal with text using
 utf8? Why do you think it should?


I think there is an option in IDLE configuration to set this. So, if my
entire system is not utf8 i can't use the IDLE for this test?



 This trial uses the default format of your system. It does the same as
   a = giorgio è giorgio.encode(default_format)
 It's a shorcut for ustring *literals* (constants), directly expressed by
 the programmer. In source code, it would use the format specified on top of
 the file.




 This trial uses ascii because you give no format (yes, it can be seen as a
 flaw). It does the same as
   a = giorgio è giorgio.encode(ascii)


Ok,so you confirm that:

s = uciao è ciao will use the file specified encoding, and that

t = ciao è ciao
t = unicode(t)

Will use, if not specified in the function, ASCII. It will ignore the
encoding I specified on the top of the file. right?

Again, thankyou. I'm loving python and his community.

Giorgio




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-04 Thread Dave Angel



Giorgio wrote:

2010/3/4 spir denis.s...@gmail.com

snip
Ok,so you confirm that:

s = uciao è ciao will use the file specified encoding, and that

t = ciao è ciao
t = unicode(t)

Will use, if not specified in the function, ASCII. It will ignore the
encoding I specified on the top of the file. right?

  
A literal  u string, and only such a (unicode) literal string, is 
affected by the encoding specification.  Once some bytes have been 
stored in a 8 bit string, the system does *not* keep track of where they 
came from, and any conversions then (even if they're on an adjacent 
line) will use the default decoder.  This is a logical example of what 
somebody said earlier on the thread -- decode any data to unicode as 
early as possible, and deal only with unicode strings in the program.  
Then, if necessary, encode them into whatever output form immediately 
before (or while) outputting them.




Again, thankyou. I'm loving python and his community.

Giorgio




  

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Stefan Behnel

Giorgio, 03.03.2010 09:36:

i am looking for more informations about encoding in python:

i've read that Amazon SimpleDB accepts every string encoded in UTF-8. How
can I encode a string?


  byte_string = unicode_string.encode('utf-8')

If you use unicode strings throughout your application, you will be happy 
with the above. Note that this is an advice, not a condition.




And, what's the default string encoding in python?


default encodings are bad, don't rely on them.

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Patrick Sabin

Giorgio wrote:

i am looking for more informations about encoding in python:

i've read that Amazon SimpleDB accepts every string encoded in UTF-8. 
How can I encode a string? And, what's the default string encoding in 
python?


I think the safest way is to use unicode strings in your application and 
convert them to byte strings if needed, using the encode and decode methods.





the other question is about mysql DB: if i have a mysql field latin1 and 
extract his content in a python script, how can I handle it?


if you have a byte string s encoded in 'latin1' you can simply call:

s.decode('latin1')

to get the unicode string.



thankyou

Giorgio


Patrick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio


  byte_string = unicode_string.encode('utf-8')

 If you use unicode strings throughout your application, you will be happy
 with the above. Note that this is an advice, not a condition.



Mmm ok. So all strings in the app are unicode by default?

Do you know if there is a function/method i can use to check encoding of a
string?



 default encodings are bad, don't rely on them.


No, ok, it was just to understand what i'm working with.

Patrick, ok. I should check if it's possible to save unicode strings in the
DB.

Do you think i'd better set my db to utf8? I don't need latin1, it's just
the default value.

Thankyou

Giorgio


-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
Oh, sorry, let me update my last post:

if i have a string, let's say:

s = hi giorgio;

and want to store it in a latin1 db, i need to convert it to latin1 before
storing, right?

2010/3/3 Giorgio anothernetfel...@gmail.com


  byte_string = unicode_string.encode('utf-8')

 If you use unicode strings throughout your application, you will be happy
 with the above. Note that this is an advice, not a condition.



 Mmm ok. So all strings in the app are unicode by default?

 Do you know if there is a function/method i can use to check encoding of a
 string?



 default encodings are bad, don't rely on them.


 No, ok, it was just to understand what i'm working with.

 Patrick, ok. I should check if it's possible to save unicode strings in the
 DB.

 Do you think i'd better set my db to utf8? I don't need latin1, it's just
 the default value.

 Thankyou

 Giorgio


 --
 --
 AnotherNetFellow
 Email: anothernetfel...@gmail.com




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Patrick Sabin



Mmm ok. So all strings in the app are unicode by default?

Depends on your python version. If you use python 2.x, you have to use a 
u before the string:


s = u'Hallo World'

Do you know if there is a function/method i can use to check encoding of 
a string?


AFAIK such a function doesn't exist. Python3 solves this by using 
unicode strings by default.


Patrick, ok. I should check if it's possible to save unicode strings in 
the DB.


It is more an issue of your database adapter, than of your database.



Do you think i'd better set my db to utf8? I don't need latin1, it's 
just the default value.


I think the encoding of the db doesn't matter much in this case, but I 
would prefer utf-8 over latin-1. If you get an utf-8 encoded raw byte 
string you call .decode('utf-8'). In case of an latin-1 encoded string 
you call .decode('latin1')



Thankyou

Giorgio

- Patrick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Stefan Behnel

Giorgio, 03.03.2010 15:50:

  Depends on your python version. If you use python 2.x, you have to use a

u before the string:

s = u'Hallo World'


Ok. So, let's go back to my first question:

s = u'Hallo World' is unicode in python 2.x -  ok


Correct.


s = 'Hallo World' how is encoded?


Depends on your source code encoding.

http://www.python.org/dev/peps/pep-0263/



Well, the problem comes,  i.e when i'm getting a string from an HTML form
with POST. I don't and can't know the encoding, right? It depends on
browser.


The browser will tell you the encoding in the headers that it transmits.

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Patrick Sabin

Giorgio wrote:


Depends on your python version. If you use python 2.x, you have to
use a u before the string:

s = u'Hallo World'


Ok. So, let's go back to my first question: 


s = u'Hallo World' is unicode in python 2.x - ok
s = 'Hallo World' how is encoded?


I am not 100% sure, but I think it depends on the encoding of your 
source file or the coding you specify. See PEP 263

http://www.python.org/dev/peps/pep-0263/

Well, the problem comes,  i.e when i'm getting a string from an HTML 
form with POST. I don't and can't know the encoding, right? It depends 
on browser.


Right, but you can do something about it. Tell the browser, which 
encoding you are going to accept:


form ... accept-charset=UTF-8
...
/form

- Patrick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
Uff, encoding is a very painful thing in programming.

Ok so now comes last layer of the encoding: the webserver.

I now know how to handle encoding in a python app and in interactions with
the db, but the last step is sending the content to the webserver.

How should i encode pages? The encoding i choose has to be the same than the
one i choose in the .htaccess file? Or maybe i can send content encoded how
i like more to apache and it re-encodes in the right way all pages?

Thankyou

2010/3/3 Patrick Sabin patrick.just4...@gmail.com

 Giorgio wrote:


Depends on your python version. If you use python 2.x, you have to
use a u before the string:

s = u'Hallo World'


 Ok. So, let's go back to my first question:
 s = u'Hallo World' is unicode in python 2.x - ok
 s = 'Hallo World' how is encoded?


 I am not 100% sure, but I think it depends on the encoding of your source
 file or the coding you specify. See PEP 263
 http://www.python.org/dev/peps/pep-0263/


  Well, the problem comes,  i.e when i'm getting a string from an HTML form
 with POST. I don't and can't know the encoding, right? It depends on
 browser.


 Right, but you can do something about it. Tell the browser, which encoding
 you are going to accept:

 form ... accept-charset=UTF-8
 ...
 /form

 - Patrick




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Dave Angel

Giorgio wrote:


 Depends on your python version. If you use python 2.x, you have to use a
  

u before the string:

s = u'Hallo World'




Ok. So, let's go back to my first question:

s = u'Hallo World' is unicode in python 2.x - ok
s = 'Hallo World' how is encoded?

  

Since it's a quote literal in your source code, it's encoded by your 
text editor when it saves the file, and you tell Python which encoding 
it was by the second line of your source file, right after the shebang line.


A sequence of bytes in an html file should be should have its encoding 
identified by the tag at the top of the html file.  And I'd  *guess* 
that on a form result, the encoding can be assumed to match that of the 
html of the form itself.


DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
Ok.

So, how do you encode .py files? UTF-8?

2010/3/3 Dave Angel da...@ieee.org

 Giorgio wrote:



  Depends on your python version. If you use python 2.x, you have to use
 a


 u before the string:

 s = u'Hallo World'




 Ok. So, let's go back to my first question:

 s = u'Hallo World' is unicode in python 2.x - ok
 s = 'Hallo World' how is encoded?



 Since it's a quote literal in your source code, it's encoded by your text
 editor when it saves the file, and you tell Python which encoding it was by
 the second line of your source file, right after the shebang line.

 A sequence of bytes in an html file should be should have its encoding
 identified by the tag at the top of the html file.  And I'd  *guess* that on
 a form result, the encoding can be assumed to match that of the html of the
 form itself.

 DaveA




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
Ops, i have another update:

string = ublabla

This is unicode, ok. Unicode UTF-8?

Thankyou

2010/3/3 Giorgio anothernetfel...@gmail.com

 Ok.

 So, how do you encode .py files? UTF-8?

 2010/3/3 Dave Angel da...@ieee.org

 Giorgio wrote:



  Depends on your python version. If you use python 2.x, you have to use
 a


 u before the string:

 s = u'Hallo World'




 Ok. So, let's go back to my first question:

 s = u'Hallo World' is unicode in python 2.x - ok
 s = 'Hallo World' how is encoded?



 Since it's a quote literal in your source code, it's encoded by your text
 editor when it saves the file, and you tell Python which encoding it was by
 the second line of your source file, right after the shebang line.

 A sequence of bytes in an html file should be should have its encoding
 identified by the tag at the top of the html file.  And I'd  *guess* that on
 a form result, the encoding can be assumed to match that of the html of the
 form itself.

 DaveA




 --
 --
 AnotherNetFellow
 Email: anothernetfel...@gmail.com




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Stefan Behnel

Giorgio, 03.03.2010 18:28:

string = ublabla

This is unicode, ok. Unicode UTF-8?


No, not UTF-8. Unicode.

You may want to read this:

http://www.amk.ca/python/howto/unicode

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
Please let me post the third update O_o. You can forgot other 2, i'll put
them into this email.

---
 s = ciao è ciao
 print s
ciao è ciao
 s.encode('utf-8')

Traceback (most recent call last):
  File pyshell#2, line 1, in module
s.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5:
ordinal not in range(128)
---

I am getting more and more confused.

I was coding in PHP and was saving some strings in the DB. Was using
utf8_encode to encode them before sending to the utf8_unicode_ci table. Ok,
the result was that strings were double encoded. To fix that I simply
removed the utf8_encode() function and put the raw data in the database
(that converts them in utf8). In other words, in PHP, I can encode a string
multiple times:

$c = giorgio è giorgio;
$c = utf8_encode($c); // this will work in an utf8 html page
$d = utf8_encode($c); // this won't work, will print a strange letter
$d = utf8_decode($d); // this will work. will print an utf8 string

Ok, now, the point is: you (and the manual) said that this line:

s = ugiorgio è giorgio

will convert the string as unicode. But also said that the part between 
will be encoded with my editor BEFORE getting encoded in unicode by python.
So please pay attention to this example:

My editor is working in UTF8. I create this:

c = giorgio è giorgio // This will be an UTF8 string because of the file's
encoding
d = unicode(c) // This will be an unicode string
e = c.encode() // How will be encoded this string? If PY is working like PHP
this will be an utf8 string.

Can you help me?

Thankyou VERY much

Giorgio
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Giorgio
I'm sorry, it's utf8_unicode_ci that's confusing me.

So, UTF-8 is one of the most commonly used encodings. UTF stands for
Unicode Transformation Format UTF8 is, we can say, a type of unicode,
right? And what about utf8_unicode_ci in mysql?

Giorgio

2010/3/3 Stefan Behnel stefan...@behnel.de

 Giorgio, 03.03.2010 18:28:

  string = ublabla

 This is unicode, ok. Unicode UTF-8?


 No, not UTF-8. Unicode.

 You may want to read this:

 http://www.amk.ca/python/howto/unicode


 Stefan

 ___
 Tutor maillist  -  Tutor@python.org
 To unsubscribe or change subscription options:
 http://mail.python.org/mailman/listinfo/tutor




-- 
--
AnotherNetFellow
Email: anothernetfel...@gmail.com
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Dave Angel
(Don't top-post.  Put your response below whatever you're responding to, 
or at the bottom.)


Giorgio wrote:

Ok.

So, how do you encode .py files? UTF-8?

2010/3/3 Dave Angel da...@ieee.org

  
I personally use Komodo to edit my python source files, and tell it to 
use UTF8 encoding.  Then I add a encoding line as the second line of the 
file.  Many times I get lazy, because mostly my source doesn't contain 
non-ASCII characters.  But if I'm copying characters from an email or 
other Unicode source, then I make sure both are set up.  The editor will 
actually warn me if I try to save a file as ASCII with any 8 bit 
characters in it.


Note:  unicode is 16 bit characters, at least in CPython 
implementation.  UTF-8 is an 8 bit encoding of that Unicode, where 
there's a direct algorithm to turn 16 or even 32 bit Unicode into 8 bit 
characters.  They are not the same, although some people use the terms 
interchangeably.


Also note:  An 8 bit string  has no inherent meaning, until you decide 
how to decode it into Unicode.  Doing explicit decodes is much safer, 
rather than assuming some system defaults.  And if it happens to contain 
only 7 bit characters, it doesn't matter what encoding you specify when 
you decode it.  Which is why all of us have been so casual about this.



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Sander Sweers
On 3 March 2010 20:44, Giorgio anothernetfel...@gmail.com wrote:
 s = ciao è ciao
 print s
 ciao è ciao
 s.encode('utf-8')
 Traceback (most recent call last):
   File pyshell#2, line 1, in module
     s.encode('utf-8')
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5:
 ordinal not in range(128)

It is confusing but once understand how it works it makes sense.

You start with a 8bit string so you will want to *decode* it to unicode string.

 s = ciao è ciao
 us = s.decode('latin-1')
 us
u'ciao \xe8 ciao'
 us2 = s.decode('iso-8859-1')
 us2
u'ciao \xe8 ciao'

Greets
Sander
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread Sander Sweers
On 3 March 2010 22:41, Sander Sweers sander.swe...@gmail.com wrote:
 It is confusing but once understand how it works it makes sense.

I remembered Kent explained it very clear in [1].

Greets
Sander

[1] http://mail.python.org/pipermail/tutor/2009-May/068920.html
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread spir
On Wed, 3 Mar 2010 16:32:01 +0100
Giorgio anothernetfel...@gmail.com wrote:

 Uff, encoding is a very painful thing in programming.

For sure, but it's true for any kind of data, not only text :-) Think at music 
or images *formats*. The issue is a bit obscured for text but the use of the 
mysterious, _cryptic_ (!), word encoding.

When editing an image using a software tool, there is a live representation of 
the image in memory (say, a plain pixel 2D array), which is probably what the 
developper found most practicle for image processing. [text processing in 
python: unicode string type] When the job is finished, you can choose between 
various formats (png, gif, jpeg..) to save and or transfer it. [text: 
utf-8/16/32, iso-8859-*, ascii...]. Conversely, if you to edit an existing 
image, the software needs to convert back from the file format into its 
internal representation; the format need to be indicated in file, or by the 
user, or guessed.

The only difference with text is that there is no builtin image or sound 
representation _type_ in python -- only because text and sound are domain 
specific data while text is needed everywhere.

Denis
-- 


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding

2010-03-03 Thread spir
On Wed, 3 Mar 2010 20:44:51 +0100
Giorgio anothernetfel...@gmail.com wrote:

 Please let me post the third update O_o. You can forgot other 2, i'll put
 them into this email.
 
 ---
  s = ciao è ciao
  print s
 ciao è ciao
  s.encode('utf-8')
 
 Traceback (most recent call last):
   File pyshell#2, line 1, in module
 s.encode('utf-8')
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5:
 ordinal not in range(128)
 ---
 
 I am getting more and more confused.

What you enter on the terminal prompt is text, encoded in a format (ascii, 
latin*, utf*,...) that probably depends on your system locale. As this format 
is always a sequence of bytes, python stores it as a plain str:
 s = ciao è ciao
 s,type(s)
('ciao \xc3\xa8 ciao', type 'str')
My system is parametered in utf8. c3-a8 is the repr of 'é' in utf8. It needs 2 
bytes because of the rules of utf8 itself. Right?

To get a python unicode string, it must be decoded from its format, for me utf8:
 u = s.decode(utf8)
 u,type(u)
(u'ciao \xe8 ciao', type 'unicode')
e8 is the unicode code for 'è' (decimal 232). You can check that in tables. It 
needs here one byte only because 232255.

[comparison with php]

 Ok, now, the point is: you (and the manual) said that this line:
 
 s = ugiorgio è giorgio
 
 will convert the string as unicode.

Yes and no: it will convert it *into* a unicode string, in the sense of a 
python representation for universal text. When seeing u... , python will 
automagically *decode* the part in ..., taking as source format the one you 
indicate in a pseudo-comment on top of you code file, eg:
# coding: utf8
Else I guess the default is the system's locale format? Or ascii? Someone knows?
So, in my case ugiorgio è giorgio is equivalent to giorgio è 
giorgio.decode(utf8):
 u1 = ugiorgio è giorgio
 u2 = giorgio è giorgio.decode(utf8)
 u1,u2
(u'giorgio \xe8 giorgio', u'giorgio \xe8 giorgio')
 u1 == u2
True

 But also said that the part between 
 will be encoded with my editor BEFORE getting encoded in unicode by python.

will be encoded with my editor BEFORE getting encoded in unicode by python
--
will be encoded *by* my editor BEFORE getting *decoded* *into* unicode by python

 So please pay attention to this example:
 
 My editor is working in UTF8. I create this:
 
 c = giorgio è giorgio // This will be an UTF8 string because of the file's
 encoding
Right.
 d = unicode(c) // This will be an unicode string
 e = c.encode() // How will be encoded this string? If PY is working like PHP
 this will be an utf8 string.

Have you tried it?
 c = giorgio è giorgio 
 d = unicode(c)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal 
not in range(128)

Now, tell us why! (the answer is below *)

 Can you help me?
 
 Thankyou VERY much
 
 Giorgio


Denis

(*)
You don't tell which format the source string is encoded in. By default, python 
uses ascii (I know, it's stupid) which max code is 127. So, 'é' is not 
accepted. Now, if I give a format, all works fine:
 d = unicode(c,utf8)
 d
u'giorgio \xe8 giorgio'

Note: unicode(c,format) is an alias for c.decode(format):
 c.decode(utf8)
u'giorgio \xe8 giorgio'


la vita e estrany

spir.wikidot.com

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding question

2009-09-09 Thread Kent Johnson
On Wed, Sep 9, 2009 at 5:06 AM, Oleg Oltar oltarase...@gmail.com wrote:
 Hi!

 One of my tests returned following text ()

 The test:
 from django.test.client import Client
  c = Client()
 resp = c.get(/)
 resp.content

 In [25]: resp.content
 Out[25]: '\r\n\r\n\r\n!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0
 Strict//EN
 http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;\r\n\r\nhtml
 xmlns=http://www.w3.org/1999/xhtml;\r\n  head\r\n    meta
 http-equiv=content-type content=text/html; charset=utf-8 /\r\n
 \r\n    \ntitleJapanese innovation |
 \xd0\xaf\xd0\xbf\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x8f
 \xd0\xb8\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8/title\n\r\n
snip
 Is there a way I can convert it to normal readable text? (I need for example
 to find a string of text in this response to check if my test case Pass or
 failed)

resp.content.decode('string_escape') will convert it to encoded bytes.
Then another decode() with the correct encoding will get you Unicode.
I'm not sure what the correct encoding is for the second decode(),
most likely one of 'utf-8', 'utf_16_le' or 'utf_16_be'.

Kent
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and Decoding

2007-01-02 Thread Kent Johnson
Carlos wrote:

 The genetic algorithm that Im using (GA) generates solutions for a given 
 problem, expressed in a list, this list is composed by integers. Every 
 element in the list takes 8 integers, is a little messy but this is because
 
 List [0] = Tens X position
 List [1] = Units X position
 List [2] = Decimals X position
 List [3] = If  than 5 the number is negative, else is positive
 
 Then if the result is List = [6, 1, 2, 3] the X position equals -612.3. 
 This is the same for the Y position. If there are 10 elements the list 
 is going to be 80 integers long and if there are 100 elements, well you 
 get a very long list...
 
 With this in mind my question would be, how can I keep track of this 
 information? I mean how can I assign this List positions to each 
 element? This is needed because this is going to be a long list and the 
 GA needs to evaluate the position of each element with respect to the 
 position of the other elements. So it needs to know that certain numbers 
 are related to certain element and it needs to have access to the size, 
 level, name and parent information... I hope that this is clear enough.

I will assume there is a good reason for storing the coordinates in this 
form...

Do the numbers have to be all in a single list? I would start by 
breaking it up into lists of four, so if you have 10 elements you would 
have a list of 20 small lists. It might make sense to pair the x and y 
lists so you have a list of 10 lists of 2 lists of 4 numbers, e.g.
[ [ [6, 1, 2, 3], [7, 2, 8, 4] ], ...]

Another thing to consider is whether you might want to make a class to 
hold the coordinate values, then you could refer to x.tens, x.units, 
x.decimal, x.sign by name.

If you need a single list for the GA to work, one alternative would be 
to make converters between the nested representation and the flat one. 
Alternately you could wrap the list in a class which provides helpful 
accessors.

HTH
Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and XML troubles

2006-11-05 Thread Kent Johnson
William O'Higgins Witteman wrote:
 I've been struggling with encodings in my XML input to Python programs.
 
 Here's the situation - my program has no declared encoding, so it
 defaults to ASCII.  It's written in Unicode, but apparently that isn't
 confusing to the parser.  Fine by me.  I import some XML, probably
 encoded in the Windows character set (I don't remember what that's
 called now).  I can read it for the most part - but it throws exceptions
 when it hits accented characters (some data is being input by French
 speakers).  I am using ElementTree for my XML parsing
 
 What I'm trying to do is figure out what I need to do to get my program
 to not barf when it hits an accented character.  I've tried adding an
 encoding line as suggested here:
 
 http://www.python.org/dev/peps/pep-0263/
 
 What these do is make the program fail to parse the XML at all.  Has
 anyone encountered this?  Suggestions?  Thanks.

As Luke says, the encoding of your program has nothing to do with the 
encoding of the XML or the types of data your program will accept. PEP 
263 only affects the encoding of string literals in your program.

It sounds like your XML is not well-formed. XML files can have an 
encoding declaration *in the XML*. If it in not present, the file is 
assumed to be in UTF-8 encoding. If your XML is in Cp1252 but lacks a 
correct encoding declaration, it is not valid XML because the Cp1252 
characters are not valid UTF-8.

Try including the line
?xml version=1.0 encoding=windows-1252?
or
?xml version=1.0 encoding=Cp1252?

as the first line of the XML. (windows-1252 is the official 
IANA-registered name for Cp1252; I'm not sure which name will actually 
work correctly.)

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and XML troubles

2006-11-05 Thread Dustin J. Mitchell
For what it's worth, the vast majority of the XML out there (especially if
you're parsing RSS feeds, etc.) is written by monkeys and is totally
ill-formed.  It seems the days of 'it looked OK in my browser' are still here.

To find out if it's your app or the XML, you could try running the XML through
a validating parser.  There are also various tools out there which might be
able to parse the XML anyway -- xmllint, I believe, can do this.

Dustin (not by *any* stretch an expert on XML *or* Unicode)
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Encoding and XML troubles

2006-11-04 Thread Luke Paireepinart
Inputting XML into a Python program has nothing to do with what encoding the python source is in.So it seems to me that that particular PEP doesn't apply in this case at all.I'm guessing that the ElementTree module has an option to use Unicode input.

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding text in html

2006-09-13 Thread anil maran
submits: We\xe2\x80\x99re pretty surthis is how it is stored in postgresplease help me outthanks- Original Message From: anil maran [EMAIL PROTECTED]To: tutor@python.orgSent: Wednesday, September 13, 2006 12:14:10 AMSubject: encoding text in html
i was trying to display some text
it is in utf-8 in postgres and when it is displayed in firefox and ie,
it gets displayed as some symols with 4numbers in a box or so
even for ' apostrophe
please tell me how to display this properly
i try 
title.__str__

or title.__repr__
both dont work___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding text in html

2006-09-13 Thread anil maran

		  
	


「ひぐらしのなく頃に」30秒TVCF風ver.0.1 this is how it is getting displayed in browser- Original Message From: anil maran [EMAIL PROTECTED]To: anil maran [EMAIL PROTECTED]Sent: Wednesday, September 13, 2006 2:07:55 AMSubject: Re: [Tutor] encoding text in htmlsubmits: We\xe2\x80\x99re pretty surthis is how it is stored in postgresplease help me outthanks- Original Message From: anil maran [EMAIL PROTECTED]To: tutor@python.orgSent: Wednesday,
 September 13, 2006 12:14:10 AMSubject: [Tutor] encoding text in html
i was trying to display some text
it is in utf-8 in postgres and when it is displayed in firefox and ie,
it gets displayed as some symols with 4numbers in a box or so
even for ' apostrophe
please tell me how to display this properly
i try 
title.__str__

or title.__repr__
both dont work___Tutor maillist-Tutor@python.orghttp://mail.python.org/mailman/listinfo/tutor___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding text in html

2006-09-13 Thread Kent Johnson
anil maran wrote:
 
 「ひぐらしのなく頃に」30秒TVCF風ver.0.1 http://youtube.com/?v=0WmeTRcAiec
 
 this is how it is getting displayed in browser

I'm pretty sure that is not how
We\xe2\x80\x99re
displays; can you show an example of the same text as it is stored and 
as it displays?

Kent

 
 - Original Message 
 From: anil maran [EMAIL PROTECTED]
 To: anil maran [EMAIL PROTECTED]
 Sent: Wednesday, September 13, 2006 2:07:55 AM
 Subject: Re: [Tutor] encoding text in html
 
 
 
 submits: We\xe2\x80\x99re pretty sur
 this is how it is stored in postgres
 please help me out
 thanks
 
 
 
 - Original Message 
 From: anil maran [EMAIL PROTECTED]
 To: tutor@python.org
 Sent: Wednesday, September 13, 2006 12:14:10 AM
 Subject: [Tutor] encoding text in html
 
 
 i was trying to display some text
 it is in utf-8 in postgres and when it is displayed in firefox and ie, 
 it gets displayed as some symols with 4numbers in a box or so
 even for ' apostrophe
 please tell me how to display this properly
 i try
 title.__str__
 
 or title.__repr__
 both dont work
 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor
 
 
 
 
 
 ___
 Tutor maillist  -  Tutor@python.org
 http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding text in html

2006-09-13 Thread Kent Johnson
anil maran wrote:
 
 
 i was trying to display some text
 it is in utf-8 in postgres and when it is displayed in firefox and ie, 
 it gets displayed as some symols with 4numbers in a box or so
 even for ' apostrophe
 please tell me how to display this properly
 i try
 title.__str__
 
 or title.__repr__
 both dont work

Do you have the page encoding set to utf-8 in Firefox? You can do this
with View / Character Encoding as a test. If it displays correctly when
you set the encoding then you should include a meta tag in the HTML that
sets the charset. Put this in the head of the HTML:
   meta http-equiv=content-type content=text/html; charset=utf-8 /

Kent


___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding text in html

2006-09-13 Thread Danny Yoo


On Wed, 13 Sep 2006, anil maran wrote:


 i was trying to display some text it is in utf-8 in postgres and when it 
 is displayed in firefox and ie, it gets displayed as some symols with 
 4numbers in a box or so even for ' apostrophe please tell me how to 
 display this properly i try title.__str__

I'm assuming that you're dynamically generating some HTML document.  If 
so, have you declared the document encoding in the HTML file to be 
utf-8?

See:

 http://www.joelonsoftware.com/articles/Unicode.html

Do you have a small sample of the HTML file that's being generated?  One 
of us here may want to inspect it to make sure you really are generating 
UTF-8 output.  You may also want to show the Python code you've written to 
generate the output.

Try to give us enough information so we can attempt to reproduce what 
you're seeing.

Good luck!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding

2006-09-11 Thread Kent Johnson
Jose P wrote:
 watch this example:
 
 a=['lula', 'cação']
 print a
 ['lula', 'ca\xc3\xa7\xc3\xa3o']
 print a[1]
 cação
 
 
 when i print the list the special characters are not printed correctly! 

When you print a list, it uses repr() to format the contents of the 
list; when you print an item directly, str() is used. For a string 
containing non-ascii characters, the results are different.
 
 But if i print only the list item that has the special charaters it runs
 OK.
 
 How do i get list print correctly?

You will have to do the formatting your self. A simple solution might be
for x in a:
   print x

If you want exactly the list formatting you have to work harder. Try 
something like
[' + ', '.join([str(x) for x in a]) + ']

Kent

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] encoding

2006-04-12 Thread Hugo González Monteverde
kakada wrote:
  LookupError: unknown encoding: ANSI
 
  so what is the correct way to do it?
 

stringinput.encode('latin_1')

works for me.

Do a Google search for Python encodings, and you will find what the 
right names for the encodings are.

http://docs.python.org/lib/standard-encodings.html

Hugo

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor