subject:"Re\: \[Tutor\] Encoding"


On 01/05/2014 12:52 AM, Steven D'Aprano wrote:

If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never use
a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.


An exception, or any other kind of anomaly detected by a func one calls, is in 
most cases a *symptom* of an error, somewhere else in one's code (possibly far 
in source, possibly long earlier, possibly apparently unrelated). Catching an 
exception (except in rare cases), is just suppressing a _signal_ about a 
probable error. Catching an exception does not make the code correct, it just 
pretends to (except in rare cases). It's like hiding the dirt under a carpet, or 
beating up the poor guy that ran for 3 kilometers to tell you a fire in 
threatening your home.


Again: the anomaly (eg wrong input) detected by a func is not the error; it is a 
consequence of the true original error, what one should aim at correcting. (But 
our culture apparently loves repressing symptoms rather than curing actual 
problems: we programmers just often thoughtlessly apply the scheme ;-)


We should instead gratefully thank func authors for having correctly done their 
jobs of controlling input. They offer us the information needed to find bugs 
which otherwise may happily go on their lives undetected; and thus the 
opportunity to write more correct software. (This is why func authors should 
control input, refuse any anomalous or dubious values, and never ever try to 
guess what the app expects in such cases; instead just say cannot do my job 
safely, or at all.)


If one is passing an empty set to an 'average' func, don't blame the func or 
shut up the signal/exception, instead be grateful to the func's author, and find 
why and how it happens the set is empty. If one is is trying to write into a 
file, don't blame the file for not existing, the user for being stupid, or shut 
up the signal/exception, instead be grateful to the func's author, and find why 
and how it happens the file does not exist, now (about the user: is your doc 
clear enough?).


The sub-category of cases where exception handling makes sense at all is the 
following:
* a called function may fail (eg average, find a given item in a list, write 
into a file)

* and, the failure case makes sense for the app, it _does_ belong to the app 
logic
* and, the case should nevertheless be handled like others up to this point in 
code (meaning, there should not be a separate branch for it, we should really 
land there in code even for this failure case)
* and, one cannot know whether it is a failure case without trying, or it would 
be as costly as just trying (wrong for average, right for 2 other examples)
* and, one can repair the failure right here, in any case, and go on correctly 
according to the app logic (depends on apps) (there is also the category of 
alternate running modes)


In such a situation, the right thing to do is to catch the exception signal (or 
use whatever error management exists, eg a check for a None return value) and 
proceed correctly (and think at testing this case ;-).


But this is not that common. In particular, if the failure case does not belong 
to the app logic (the item should be there, the file should exist) then do *not* 
catch a potential signal: if it happens, it would tell you about a bug 
*elsewhere* in code; and _this_ is what is to correct.


There a mythology in programming, that software should not crash; wrongly 
understood (or rightly, authors of such texts usually are pretty unclear and 
ambiguous), this leads to catching exceptions that are just signal of symptoms 
of errors... Instead, software should crash whenever it is incorrect; often 
(when the error does not cause obvious misbehaviour) it is the only way for the 
programmer to know about errors. Crashes are the programmer's best friend (I 
mean, those programmers which aim is to write quality software).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/04/2014 08:26 PM, Alex Kleider wrote:

Any suggestions as to a better way to handle the problem of encoding in the
following context would be appreciated.  The problem arose because 'Bogota' is
spelt with an acute accent on the 'a'.

$ cat IP_info.py3
#!/usr/bin/env python3
# -*- coding : utf -8 -*-
# file: 'IP_info.py3'  a module.

import urllib.request

url_format_str = \
 'http://api.hostip.info/get_html.php?ip=%sposition=true'

def ip_info(ip_address):
 
Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!

 response =  urllib.request.urlopen(url_format_str %\
(ip_address, )).read()
 sp = response.splitlines()
 country = city = lat = lon = ip = ''
 for item in sp:
 if item.startswith(bCountry:):
 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)
 country = item[9:]
 elif item.startswith(bCity:):
 try:
 city = item[6:].decode('utf-8')
 except:
 print(Exception raised.)
 city = item[6:]
 elif item.startswith(bLatitude:):
 try:
 lat = item[10:].decode('utf-8')
 except:
 print(Exception raised.)
 lat = item[10]
 elif item.startswith(bLongitude:):
 try:
 lon = item[11:].decode('utf-8')
 except:
 print(Exception raised.)
 lon = item[11]
 elif item.startswith(bIP:):
 try:
 ip = item[4:].decode('utf-8')
 except:
 print(Exception raised.)
 ip = item[4:]
 return {Country : country,
 City : city,
 Lat : lat,
 Long : lon,
 IP : ip}

if __name__ == __main__:
 addr =  201.234.178.62
 print (IP address is %(IP)s:
 Country: %(Country)s;  City: %(City)s.
 Lat/Long: %(Lat)s/%(Long)s % ip_info(addr))


The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.

Thanks,
alex


'á' does not encode to 0xe1 in utf8 encoding; what you read is probably (legacy) 
files in probably latin-1 (or another latin-* encoding).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/05/2014 03:31 AM, Alex Kleider wrote:

I've been maintaining both a Python3 and a Python2.7 version.  The latter has
actually opened my eyes to more complexities. Specifically the need to use
unicode strings rather than Python2.7's default ascii.


So-called Unicode strings are not the solution to all problems. Example with 
your 'á', which can be represented by either 1 precomposed code (unicode code 
point) 0xe1, or ibasically by 2 ucodes (one for the base 'a', one for the 
combining '´'). Imagine you search for Bogotá: how do you know which is 
reprsentation is used in the text you search? How do you know at all there are 
multiple representations, and what they are? The routine wil work iff, by 
chance, your *programming editor* (!) used the same representation as the 
software used to create the searched test...


Usually it the case, because most text-creation software use precomposed codes, 
when they exist, for composite characters. (But this fact just makes the issue 
more rare, hard to be aware of, and thus difficult to cope with correctly in 
code. As far as I know nearly no software does it.)


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/05/2014 08:57 AM, Alex Kleider wrote:

On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.


Well, I've tried the xml approach which seems promising but still I get an
encoding related error.

.org/mailman/listinfo/tutor

Note that the (computing) data description format (JSON, XML...) and the textual 
format, or encoding (Unicode utf8/16/32, legacy iso-8859-* also called 
latin-*, ...) are more or less unrelated and independant. Changing the data 
description format cannot solve a text encoding issue (but it may hide it, if by 
chance the new data description format happened to use the text encoding you 
happen to use when reading, implicitely or explicitely).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread Mark Lawrence


On 05/01/2014 02:31, Alex Kleider wrote:


I've been maintaining both a Python3 and a Python2.7 version.  The
latter has actually opened my eyes to more complexities. Specifically
the need to use unicode strings rather than Python2.7's default ascii.



This might help http://python-future.org/

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread Steven D'Aprano

On Sat, Jan 04, 2014 at 11:57:20PM -0800, Alex Kleider wrote:

 Well, I've tried the xml approach which seems promising but still I get 
 an encoding related error.
 Is there a bug in the xml.etree module (not very likely, me thinks) or 
 am I doing something wrong?

I'm no expert on XML, but it looks to me like it is a bug in 
ElementTree. It doesn't appear to handle unicode strings correctly 
(although perhaps it doesn't promise to).

A simple demonstration using Python 2.7:

py import xml.etree.ElementTree as ET
py ET.fromstring(u'xmla/xml')
Element 'xml' at 0xb7ca982c

But:

py ET.fromstring(u'xmlá/xml')
Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1282, in XML
parser.feed(text)
  File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1622, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in 
position 5: ordinal not in range(128)

An easy work-around:

py ET.fromstring(u'xmlá/xml'.encode('utf-8'))
Element 'xml' at 0xb7ca9a8c

although, as I said, I'm no expert on XML and this may lead to errors 
later on.


 There's no denying that the whole encoding issue is still not completely 
 clear to me in spite of having devoted a lot of time to trying to grasp 
 all that's involved.

Have you read Joel On Software's explanation?

http://www.joelonsoftware.com/articles/Unicode.html

It's well worth reading. Start with that, and then ask if you have any 
further questions.


 Here's what I've got:
 
 alex@x301:~/Python/Parse$ cat ip_xml.py
 #!/usr/bin/env python
 # -*- coding : utf -8 -*-
 # file: 'ip_xml.py'
[...]
 tree = ET.fromstring(xml)
 root = tree.getroot()   # Here's where it blows up!!!

I reckon that what you need is to change the first line to:

tree = ET.fromstring(xml.encode('latin-1'))

or whatever the encoding is meant to be.


-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread eryksun

On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net wrote:
 def ip_info(ip_address):

 response =  urllib2.urlopen(url_format_str %\
(ip_address, ))
 encoding = response.headers.getparam('charset')
 print 'encoding' is '%s'. % (encoding, )
 info = unicode(response.read().decode(encoding))

decode() returns a unicode object.

 n = info.find('\n')
 print location of first newline is %s. % (n, )
 xml = info[n+1:]
 print 'xml' is '%s'. % (xml, )

 tree = ET.fromstring(xml)
 root = tree.getroot()   # Here's where it blows up!!!
 print 'root' is '%s', with the following children: % (root, )
 for child in root:
 print child.tag, child.attrib
 print END of CHILDREN
 return info

Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?

Leave it to ElementTree. Here's something to get you started:

import urllib2
import xml.etree.ElementTree as ET
import collections

url_format_str = 'http://api.hostip.info/?ip=%sposition=true'
GML = 'http://www.opengis.net/gml'
IPInfo = collections.namedtuple('IPInfo', '''
ip
city
country
latitude
longitude
''')

def ip_info(ip_address):
response = urllib2.urlopen(url_format_str %
   ip_address)
tree = ET.fromstring(response.read())
hostip = tree.find('{%s}featureMember/Hostip' % GML)
ip = hostip.find('ip').text
city = hostip.find('{%s}name' % GML).text
country = hostip.find('countryName').text
coord = hostip.find('.//{%s}coordinates' % GML).text
lon, lat = coord.split(',')
return IPInfo(ip, city, country, lat, lon)


 info = ip_info('201.234.178.62')
 info.ip
'201.234.178.62'
 info.city, info.country
(u'Bogot\xe1', 'COLOMBIA')
 info.latitude, info.longitude
('10.4', '-75.2833')

This assumes everything works perfect. You have to decide how to fail
gracefully for the service being unavailable or malformed XML
(incomplete or corrupted response, etc).
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread Alex Kleider


On 2014-01-05 08:02, eryksun wrote:
On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net 
wrote:

def ip_info(ip_address):

response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
print 'encoding' is '%s'. % (encoding, )
info = unicode(response.read().decode(encoding))


decode() returns a unicode object.


n = info.find('\n')
print location of first newline is %s. % (n, )
xml = info[n+1:]
print 'xml' is '%s'. % (xml, )

tree = ET.fromstring(xml)
root = tree.getroot()   # Here's where it blows up!!!
print 'root' is '%s', with the following children: % (root, )
for child in root:
print child.tag, child.attrib
print END of CHILDREN
return info


Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?

Leave it to ElementTree. Here's something to get you started:

import urllib2
import xml.etree.ElementTree as ET
import collections

url_format_str = 'http://api.hostip.info/?ip=%sposition=true'
GML = 'http://www.opengis.net/gml'
IPInfo = collections.namedtuple('IPInfo', '''
ip
city
country
latitude
longitude
''')

def ip_info(ip_address):
response = urllib2.urlopen(url_format_str %
   ip_address)
tree = ET.fromstring(response.read())
hostip = tree.find('{%s}featureMember/Hostip' % GML)
ip = hostip.find('ip').text
city = hostip.find('{%s}name' % GML).text
country = hostip.find('countryName').text
coord = hostip.find('.//{%s}coordinates' % GML).text
lon, lat = coord.split(',')
return IPInfo(ip, city, country, lat, lon)


 info = ip_info('201.234.178.62')
 info.ip
'201.234.178.62'
 info.city, info.country
(u'Bogot\xe1', 'COLOMBIA')
 info.latitude, info.longitude
('10.4', '-75.2833')

This assumes everything works perfect. You have to decide how to fail
gracefully for the service being unavailable or malformed XML
(incomplete or corrupted response, etc).


Thanks again for the input.
You're using some ET syntax there that would probably make my code much 
more readable but will require a bit more study on my part.


I was up all night trying to get this sorted out and was finally 
successful.

(Re-) Reading 'joelonsoftware' and some of the Python docs helped.
Here's what I came up with (still needs modification to return a 
dictionary, but that'll be trivial.)


alex@x301:~/Python/Parse$ cat ip_xml.py
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# -*- coding : utf-8 -*-
# file: 'ip_xml.py'

import urllib2
import xml.etree.ElementTree as ET


url_format_str = \
u'http://api.hostip.info/?ip=%sposition=true'

def ip_info(ip_address):
response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# info comes in as type 'unicode'.
n = info.find('\n')
xml = info[n+1:]  # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode(utf-8))
# This is the part I still don't fully understand but would
# probably have to look at the library source to do so.
info = []
for i in range(4):
info.append(root[3][0][i].text)
info.append(root[3][0][4][0][0][0].text)

return info

if __name__ == __main__:
info = ip_info(201.234.178.62)
print info
print info[1]

alex@x301:~/Python/Parse$ ./ip_xml.py
['201.234.178.62', u'Bogot\xe1', 'COLOMBIA', 'CO', '-75.2833,10.4']
Bogotá

Thanks to all who helped.
ak
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread Steven D'Aprano

On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:

 Danny walked you through the XML. Note that he didn't decode the
 response. It includes an encoding on the first line:
 
 ?xml version=1.0 encoding=ISO-8859-1 ?

That surprises me. I thought XML was only valid in UTF-8? Or maybe that 
was wishful thinking.

 tree = ET.fromstring(response.read())

In other words, leave it to ElementTree to manage the decoding and 
encoding itself. Nice -- I like that solution.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread Alex Kleider


On 2014-01-05 14:26, Steven D'Aprano wrote:

On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:


Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

?xml version=1.0 encoding=ISO-8859-1 ?


That surprises me. I thought XML was only valid in UTF-8? Or maybe that
was wishful thinking.


tree = ET.fromstring(response.read())


I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having 
been advertised as having been encoded in ISO-8859-1 (which I believe is 
synonymous with Latin-1), my script (specifically Python's xml parser: 
xml.etree.ElementTree) didn't work until the xml was decoded from 
Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet 
with some comments mentioning the painful lessons learned:


response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# info comes in as type 'unicode'.
n = info.find('\n')
xml = info[n+1:]  # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode(utf-8))





In other words, leave it to ElementTree to manage the decoding and
encoding itself. Nice -- I like that solution.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-05 Thread eryksun

On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano st...@pearwood.info wrote:
 On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:

 ?xml version=1.0 encoding=ISO-8859-1 ?

 That surprises me. I thought XML was only valid in UTF-8? Or maybe that
 was wishful thinking.

JSON text SHALL be encoded in Unicode:

https://tools.ietf.org/html/rfc4627#section-3

For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the
MIME charset takes precedence. Section 8 has examples:

https://tools.ietf.org/html/rfc3023#section-8

So I was technically wrong to rely on the XML encoding (they happen to
be the same in this case). Instead you can create a parser with the
encoding from the header:

encoding = response.headers.getparam('charset')
parser = ET.XMLParser(encoding=encoding)
tree = ET.parse(response, parser)

The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1
and Unicode transport encodings. So it's probably better to transcode
to UTF-8 as Alex is doing, but then use a custom parser to override
the XML encoding:

encoding = response.headers.getparam('charset')
info = response.read().decode(encoding).encode('utf-8')

parser = ET.XMLParser(encoding='utf-8')
tree = ET.fromstring(info, parser)
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread eryksun

On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net wrote:
 The output I get on an Ubuntu 12.4LTS system is as follows:
 alex@x301:~/Python/Parse$ ./IP_info.py3
 Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833


 I would have thought that utf-8 could handle the 'a-acute'.

b'\xe1' is Latin-1. Look in the response headers:

url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true'

 response = urllib.request.urlopen(url)
 response.headers.get_charsets()
['iso-8859-1']

 encoding = response.headers.get_charsets()[0]
 sp = response.read().decode(encoding).splitlines()
 sp[1]
'City: Bogotá'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 12:01, eryksun wrote:
On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net 
wrote:

.


b'\xe1' is Latin-1. Look in the response headers:

url = 
'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true'


 response = urllib.request.urlopen(url)
 response.headers.get_charsets()
['iso-8859-1']

 encoding = response.headers.get_charsets()[0]
 sp = response.read().decode(encoding).splitlines()
 sp[1]
'City: Bogotá'


Thank you very much.  Now things are more clear.
cheers,
alex
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
 Any suggestions as to a better way to handle the problem of encoding in 
 the following context would be appreciated.

Python gives you lots of useful information when errors occur, but 
unfortunately your code throws that information away and replaces it 
with a totally useless message:

 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)

Oh great. An exception was raised. What sort of exception? What error 
message did it have? Why did it happen? Nobody knows, because you throw 
it away.

Never, never, never do this. If you don't understand an exception, you 
have no business covering it up and hiding that it took place. Never use 
a bare try...except, always catch the *smallest* number of specific 
exception types that make sense. Better is to avoid catching exceptions 
at all: an exception (usually) means something has gone wrong. You 
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash, 
it just covers up the fact that an exception occured.


 The output I get on an Ubuntu 12.4LTS system is as follows:
 alex@x301:~/Python/Parse$ ./IP_info.py3
 Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833
 
 
 I would have thought that utf-8 could handle the 'a-acute'.

Of course it can:

py 'Bogotá'.encode('utf-8')
b'Bogot\xc3\xa1'

py b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode 
it using UTF-8 fails.

py b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: 
unexpected end of data


More to follow...




-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 15:52, Steven D'Aprano wrote:


Oh great. An exception was raised. What sort of exception? What error
message did it have? Why did it happen? Nobody knows, because you throw
it away.

Never, never, never do this. If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never 
use

a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare. -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.



The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
IP address is 201.234.178.62:
Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.


Of course it can:

py 'Bogotá'.encode('utf-8')


I'm interested in knowing how you were able to enter the above line 
(assuming you have a key board similar to mine.)




b'Bogot\xc3\xa1'

py b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode
it using UTF-8 fails.

py b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5:
unexpected end of data


More to follow...



I very much agree with your remarks.
In a pathetic attempt at self defence I just want to mention that what I 
presented wasn't what I thought was a solution.
Rather it was an attempt to figure out what the problem was as a 
preliminary step to fixing it.

With help from you and others, I was successful in doing this.
And for that help, I thank all list participants very much.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread eryksun

On Sat, Jan 4, 2014 at 7:15 PM, Alex Kleider aklei...@sonic.net wrote:

 py 'Bogotá'.encode('utf-8')

 I'm interested in knowing how you were able to enter the above line
 (assuming you have a key board similar to mine.)

I use an international keyboard layout:

https://en.wikipedia.org/wiki/QWERTY#US-International

One could also copy and paste from a printed literal:

 'Bogot\xe1'
'Bogotá'

Or more verbosely:

 'Bogot\N{latin small letter a with acute}'
   'Bogotá'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

Following my previous email...

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
 Any suggestions as to a better way to handle the problem of encoding in 
 the following context would be appreciated.  The problem arose because 
 'Bogota' is spelt with an acute accent on the 'a'.

Eryksun has given the right answer for how to extract the encoding from 
the webpage's headers. That will help 9 times out of 10. But 
unfortunately sometimes webpages will lack an encoding header, or they 
will lie, or the text will be invalid for that encoding. What to do 
then?

Let's start by factoring out the repeated code in your giant for-loop 
into something more manageable and maintainable:

 sp = response.splitlines()
 country = city = lat = lon = ip = ''
 for item in sp:
 if item.startswith(bCountry:):
 try:
 country = item[9:].decode('utf-8')
 except:
 print(Exception raised.)
 country = item[9:]
 elif item.startswith(bCity:):
 try:
 city = item[6:].decode('utf-8')
 except:
 print(Exception raised.)
 city = item[6:]

and so on, becomes:

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
country = city = lat = lon = ip = ''
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key == 'Country':
country = value
elif key == 'City':
city = value
elif key == 'Latitude':
lat = value
elif key = Longitude:
lon = value
elif key = 'IP':
ip = value
else:
raise ValueError('unknown key %s found' % key)
return {Country : country,
City : city,
Lat : lat,
Long : lon,
IP : ip
}


But we can do better than that!

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
record = {Country: None, City: None, Latitude: None, 
  Longitude: None, IP: None}
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key in record:
record[key] = value
else:
raise ValueError('unknown key %s found' % key)
if None in list(record.values()):
for key, value in record.items():
if value is None: break
raise ValueError('missing key in record: %s' % key)
return record


This simplifies the code a lot, and adds some error-handling. It may be 
appropriate for your application to handle missing keys by using some 
default value, such as an empty string, or some other value that cannot 
be mistaken for an actual value, say *missing*. But since I don't know 
your application's needs, I'm going to leave that up to you. Better to 
start strict and loosen up later, than start too loose and never realise 
that errors are occuring.

I've also changed the keys Lat and Lon to Latitude and 
Longitude. If that's a problem, it's easy to fix. Just before 
returning the record, change the key:

record['Lat'] = record.pop('Latitude')

and similar for Longitude.

Now that the code is simpler to read and maintain, we can start dealing 
with the risk that the encoding will be missing or wrong.

A missing encoding is easy to handle: just pick a default encoding, and 
hope it is the right one. UTF-8 is a good choice. (It's the only 
*correct* choice, everybody should be using UTF-8, but alas they often 
don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header 
is missing, and you should be good.

How to deal with incorrect encodings? That can happen when the website 
creator *thinks* they are using a certain encoding, but somehow invalid 
bytes for that encoding creep into the data. That gives us a few 
different strategies:

(1) The third-party chardet module can analyse text and try to guess 
what encoding it *actually* is, rather than what encoding it claims to 
be. This is what Firefox and other web browsers do, because there are an 
awful lot of shitty websites out there. But it's not foolproof, so even 
if it guesses correctly, you still have to deal with invalid data.

(2) By default, the decode method will raise an exception. You can catch 
the exception and try again with a different encoding:

for codec in (encoding, 'utf-8', 'latin-1'):
try:
key = key.decode(codec)
except UnicodeDecodeError:
pass
else:
break

Latin-1 should be last, because it has the nice property that it will 
*always* succeed. That doesn't mean it will give you the right 
characters, as intended by the person who wrote the website, just that 
it will always give you *some* characters. They may be completely wrong, 
in other

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

On Sat, Jan 04, 2014 at 04:15:30PM -0800, Alex Kleider wrote:

 py 'Bogotá'.encode('utf-8')
 
 I'm interested in knowing how you were able to enter the above line 
 (assuming you have a key board similar to mine.)

I'm running Linux, and I use the KDE or Gnome character selector, 
depending on which computer I'm using. They give you a graphical window 
showing a screenful of characters at a time, depending on which 
application I'm using you can search for characters by name or property, 
then copy them into the clipboard to paste them into another 
application.

I can also use the compose key. My keyboard doesn't have an actual key 
labelled compose, but my system is set to use the right-hand Windows key 
(between Alt and the menu key) as the compose key. (Why the left-hand 
Windows key isn't set to do the same thing is a mystery to me.) So if I 
type:

Compose 'a

I get á.

The problem with the compose key is that it's not terribly intuitive. 
Sure, a few of them are: Compose 1 2 gives ½ but how do I get π (pi)? 
Compose p doesn't work.



-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

A heartfelt thank you to those of you that have given me much to ponder 
with your helpful responses.
In the mean time I've rewritten my procedure using a different approach 
all together.  I'd be interested in knowing if you think it's worth 
keeping or do you suggest I use your revisions to my original hack?


I've been maintaining both a Python3 and a Python2.7 version.  The 
latter has actually opened my eyes to more complexities. Specifically 
the need to use unicode strings rather than Python2.7's default ascii.


Here it is:
alex@x301:~/Python/Parse$ cat ip_info.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-

import re
import urllib2

url_format_str = \
u'http://api.hostip.info/get_html.php?ip=%sposition=true'

info_exp = r
Country:[ ](?Pcountry.*)
[\n]
City:[ ](?Pcity.*)
[\n]
[\n]
Latitude:[ ](?Plat.*)
[\n]
Longitude:[ ](?Plon.*)
[\n]
IP:[ ](?Pip.*)

info_pattern = re.compile(info_exp, re.VERBOSE).search

def ip_info(ip_address):

Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!

response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')

info = info_pattern(response.read().decode(encoding))
return {Country : unicode(info.group(country)),
City : unicode(info.group(city)),
Lat : unicode(info.group(lat)),
Lon : unicode(info.group(lon)),
IP : unicode(info.group(ip))}

if __name__ == __main__:
print IP address is %(IP)s:
Country: %(Country)s;  City: %(City)s.
Lat/Long: %(Lat)s/%(Lon)s % ip_info(201.234.178.62)

Apart from soliciting your general comments, I'm also interested to know 
exactly what the line

# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using vim 
and I assume things are encoded as ascii?


I've discovered that with Ubuntu it's very easy to switch from English 
(US) to English (US, international with dead keys) with just two clicks 
so thanks for that tip as well.




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##
 import json
 import urllib
 response = urllib.urlopen('http://api.hostip.info/get_json.php')
 info = json.load(response)
 info
{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#


Best of wishes!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

You were asking earlier about the line:

# -*- coding : utf -8 -*-

See PEP 263:

http://www.python.org/dev/peps/pep-0263/
http://docs.python.org/release/2.3/whatsnew/section-encodings.html

It's a line that tells Python how to interpret the bytes of your
source program.  It allows us to write unicode literal strings
embedded directly in the program source itself.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Dave Angel

On Sat, 04 Jan 2014 18:31:13 -0800, Alex Kleider aklei...@sonic.net 
wrote:

exactly what the line
# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using 
vim 

and I assume things are encoded as ascii?


I don't know vim specifically,  but I'm 99% sure it will let you 
specify the encoding,. Certainly emacs does, so I'd not expect vim to 
fall behind on such a fundamental point.   Anyway it's also likely 
that it defaults to utf for new files.  Anyway your job is to make 
sure that the encoding line matches what the editor is using.  Emacs 
also looks in the first few lines for that same encoding line, so if 
you format it carefully, it'll just work. Easy to test anyway for 
yourself.  Just paste some international characters into a literal 
string.


--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 18:44, Danny Yoo wrote:

Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##

import json
import urllib
response = urllib.urlopen('http://api.hostip.info/get_json.php')
info = json.load(response)
info

{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#




This strikes me as being the most elegant solution to date, and I thank 
you for it!


The problem is that the city name doesn't come in:

alex@x301:~/Python/Parse$ cat tutor.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-
# file: 'tutor.py'

Put your docstring here.

print Running 'tutor.py'...

import json
import urllib
response = urllib.urlopen\
 ('http://api.hostip.info/get_json.php?ip=201.234.178.62position=true')
info = json.load(response)
print info

alex@x301:~/Python/Parse$ ./tutor.py
Running 'tutor.py'...
{u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', 
u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': 
u'-75.2833'}


If I use my own IP the city comes in fine so there must still be some 
problem with the encoding.

should I be using
encoding = response.headers.getparam('charset')
in there somewhere?



Any ideas?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread eryksun

On Sat, Jan 4, 2014 at 11:16 PM, Alex Kleider aklei...@sonic.net wrote:
 {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code':
 u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'}

 If I use my own IP the city comes in fine so there must still be some
 problem with the encoding.

Report a bug in their JSON API. It's returning b'city:null'. I see
the same problem for www.msj.go.cr in San José, Costa Rica. It's
probably broken for all non-ASCII byte strings.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

Oh! That's unfortunate! That looks like a bug on the hostip.info
side. Check with them about it.

I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.

[... XML rant about to start. I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach. But I truly dislike XML for being implemented in ways that
are usually not fun to navigate: either the APIs or the encoded data
are usually convoluted enough to make it a chore rather than a
pleasure.

The beginning does look similar:

##
import xml.etree.ElementTree as ET
import urllib
response =
urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;)
tree = ET.parse(response)
tree
xml.etree.ElementTree.ElementTree object at 0x185a2d0
##

Up to this point, not so bad. But this is where it starts to look silly:

##
tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text
'201.234.178.62'
tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text
u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.

More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

##
tree.find('.//{http://www.opengis.net/gml}coordinates').text
'-75.2833,10.4'
##

which is truly silly. Why is the latitude and longitude not two
separate, structured values? What is this XML buying us here, really
then? I'm convinced that all the extraneous structure and complexity
in XML causes the people who work with it to stop caring, the result
being something that isn't for the benefit of either humans nor
computer programs.

Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant. :P
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

 then?  I'm convinced that all the extraneous structure and complexity
 in XML causes the people who work with it to stop caring, the result
 being something that isn't for the benefit of either humans nor
 computer programs.


... I'm sorry.  Sometimes I get grumpy when I haven't had a Snickers.

I should not have said the above here.  It isn't factual, and worse,
it insinuates an uncharitable intent to people who I do not know.
There's enough insinuation and insults out there in the world already:
I should not be contributing to those things.  For that, I apologize.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

On 2014-01-04 21:20, Danny Yoo wrote:

Oh! That's unfortunate! That looks like a bug on the hostip.info
side. Check with them about it.

I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.

[... XML rant about to start. I am not disinterested, so my apologies
in advance.]

The beginning does look similar:

import xml.etree.ElementTree as ET
import urllib
response =
urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;)

tree = ET.parse(response)
tree

xml.etree.ElementTree.ElementTree object at 0x185a2d0
##

Up to this point, not so bad. But this is where it starts to look
silly:

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text

'201.234.178.62'

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text

u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.

More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

tree.find('.//{http://www.opengis.net/gml}coordinates').text

'-75.2833,10.4'
##

Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant. :P

Not a rant at all.

As it turns out, one of the other things that have interested me of late
is docbook, an xml dialect (I think this is the correct way to express
it.) I've found it very useful and so do not share your distaste for
xml although one can't disagree with the points you've made with regard
to xml as a solution to the problem under discussion.
I've not played with the python xml interfaces before so this will be a
good project for me.

Thanks.
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question