subject:"Re\: \[Tutor\] encoding question"


On 01/05/2014 08:57 AM, Alex Kleider wrote:

On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.


Well, I've tried the xml approach which seems promising but still I get an
encoding related error.

.org/mailman/listinfo/tutor

Note that the (computing) data description format (JSON, XML...) and the textual 
format, or "encoding" (Unicode utf8/16/32, legacy iso-8859-* also called 
latin-*, ...) are more or less unrelated and independant. Changing the data 
description format cannot solve a text encoding issue (but it may hide it, if by 
chance the new data description format happened to use the text encoding you 
happen to use when reading, implicitely or explicitely).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/05/2014 03:31 AM, Alex Kleider wrote:

I've been maintaining both a Python3 and a Python2.7 version.  The latter has
actually opened my eyes to more complexities. Specifically the need to use
unicode strings rather than Python2.7's default ascii.


So-called Unicode strings are not the solution to all problems. Example with 
your 'á', which can be represented by either 1 "precomposed" code (unicode code 
point) 0xe1, or ibasically by 2 ucodes (one for the "base" 'a', one for the 
"combining" '´'). Imagine you search for "Bogotá": how do you know which is 
reprsentation is used in the text you search? How do you know at all there are 
multiple representations, and what they are? The routine wil work iff, by 
chance, your *programming editor* (!) used the same representation as the 
software used to create the searched test...


Usually it the case, because most text-creation software use precomposed codes, 
when they exist, for composite characters. (But this fact just makes the issue 
more rare, hard to be aware of, and thus difficult to cope with correctly in 
code. As far as I know nearly no software does it.)


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/04/2014 08:26 PM, Alex Kleider wrote:

Any suggestions as to a better way to handle the problem of encoding in the
following context would be appreciated.  The problem arose because 'Bogota' is
spelt with an acute accent on the 'a'.

$ cat IP_info.py3
#!/usr/bin/env python3
# -*- coding : utf -8 -*-
# file: 'IP_info.py3'  a module.

import urllib.request

url_format_str = \
 'http://api.hostip.info/get_html.php?ip=%s&position=true'

def ip_info(ip_address):
 """
Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!
"""
 response =  urllib.request.urlopen(url_format_str %\
(ip_address, )).read()
 sp = response.splitlines()
 country = city = lat = lon = ip = ''
 for item in sp:
 if item.startswith(b"Country:"):
 try:
 country = item[9:].decode('utf-8')
 except:
 print("Exception raised.")
 country = item[9:]
 elif item.startswith(b"City:"):
 try:
 city = item[6:].decode('utf-8')
 except:
 print("Exception raised.")
 city = item[6:]
 elif item.startswith(b"Latitude:"):
 try:
 lat = item[10:].decode('utf-8')
 except:
 print("Exception raised.")
 lat = item[10]
 elif item.startswith(b"Longitude:"):
 try:
 lon = item[11:].decode('utf-8')
 except:
 print("Exception raised.")
 lon = item[11]
 elif item.startswith(b"IP:"):
 try:
 ip = item[4:].decode('utf-8')
 except:
 print("Exception raised.")
 ip = item[4:]
 return {"Country" : country,
 "City" : city,
 "Lat" : lat,
 "Long" : lon,
 "IP" : ip}

if __name__ == "__main__":
 addr =  "201.234.178.62"
 print ("""IP address is %(IP)s:
 Country: %(Country)s;  City: %(City)s.
 Lat/Long: %(Lat)s/%(Long)s""" % ip_info(addr))
"""

The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
 IP address is 201.234.178.62:
 Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
 Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.

Thanks,
alex


'á' does not encode to 0xe1 in utf8 encoding; what you read is probably (legacy) 
files in probably latin-1 (or another latin-* encoding).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 01/05/2014 12:52 AM, Steven D'Aprano wrote:

If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never use
a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare." -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.


An exception, or any other kind of anomaly detected by a func one calls, is in 
most cases a *symptom* of an error, somewhere else in one's code (possibly far 
in source, possibly long earlier, possibly apparently unrelated). Catching an 
exception (except in rare cases), is just suppressing a _signal_ about a 
probable error. Catching an exception does not make the code correct, it just 
pretends to (except in rare cases). It's like hiding the dirt under a carpet, or 
beating up the poor guy that ran for 3 kilometers to tell you a fire in 
threatening your home.


Again: the anomaly (eg wrong input) detected by a func is not the error; it is a 
consequence of the true original error, what one should aim at correcting. (But 
our culture apparently loves repressing symptoms rather than curing actual 
problems: we programmers just often thoughtlessly apply the scheme ;-)


We should instead gratefully thank func authors for having correctly done their 
jobs of controlling input. They offer us the information needed to find bugs 
which otherwise may happily go on their lives undetected; and thus the 
opportunity to write more correct software. (This is why func authors should 
control input, refuse any anomalous or dubious values, and never ever try to 
guess what the app expects in such cases; instead just say "cannot do my job 
safely, or at all".)


If one is passing an empty set to an 'average' func, don't blame the func or 
shut up the signal/exception, instead be grateful to the func's author, and find 
why and how it happens the set is empty. If one is is trying to write into a 
file, don't blame the file for not existing, the user for being stupid, or shut 
up the signal/exception, instead be grateful to the func's author, and find why 
and how it happens the file does not exist, now (about the user: is your doc 
clear enough?).


The sub-category of cases where exception handling makes sense at all is the 
following:
* a called function may fail (eg average, find a given item in a list, write 
into a file)

* and, the failure case makes sense for the app, it _does_ belong to the app 
logic
* and, the case should nevertheless be handled like others up to this point in 
code (meaning, there should not be a separate branch for it, we should really 
land there in code even for this failure case)
* and, one cannot know whether it is a failure case without trying, or it would 
be as costly as just trying (wrong for average, right for 2 other examples)
* and, one can repair the failure right here, in any case, and go on correctly 
according to the app logic (depends on apps) (there is also the category of 
alternate running modes)


In such a situation, the right thing to do is to catch the exception signal (or 
use whatever error management exists, eg a check for a None return value) and 
proceed correctly (and think at testing this case ;-).


But this is not that common. In particular, if the failure case does not belong 
to the app logic (the item should be there, the file should exist) then do *not* 
catch a potential signal: if it happens, it would tell you about a bug 
*elsewhere* in code; and _this_ is what is to correct.


There a mythology in programming, that software should not crash; wrongly 
understood (or rightly, authors of such texts usually are pretty unclear and 
ambiguous), this leads to catching exceptions that are just signal of symptoms 
of errors... Instead, software should crash whenever it is incorrect; often 
(when the error does not cause obvious misbehaviour) it is the only way for the 
programmer to know about errors. Crashes are the programmer's best friend (I 
mean, those programmers which aim is to write quality software).


Denis
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 21:20, Danny Yoo wrote:

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.


Well, I've tried the xml approach which seems promising but still I get 
an encoding related error.
Is there a bug in the xml.etree module (not very likely, me thinks) or 
am I doing something wrong?
There's no denying that the whole encoding issue is still not completely 
clear to me in spite of having devoted a lot of time to trying to grasp 
all that's involved.


Here's what I've got:

alex@x301:~/Python/Parse$ cat ip_xml.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-
# file: 'ip_xml.py'

import urllib2
import xml.etree.ElementTree as ET


url_format_str = \
u'http://api.hostip.info/?ip=%s&position=true'

def ip_info(ip_address):
response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')
print "'encoding' is '%s'." % (encoding, )
info = unicode(response.read().decode(encoding))
n = info.find('\n')
print "location of first newline is %s." % (n, )
xml = info[n+1:]
print "'xml' is '%s'." % (xml, )

tree = ET.fromstring(xml)
root = tree.getroot()   # Here's where it blows up!!!
print "'root' is '%s', with the following children:" % (root, )
for child in root:
print child.tag, child.attrib
print "END of CHILDREN"
return info

if __name__ == "__main__":
info = ip_info("201.234.178.62")

alex@x301:~/Python/Parse$ ./ip_xml.py
'encoding' is 'iso-8859-1'.
location of first newline is 44.
'xml' is 'xmlns:gml="http://www.opengis.net/gml"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:noNamespaceSchemaLocation="http://www.hostip.info/api/hostip-1.0.1.xsd";>

 This is the Hostip Lookup Service
 hostip
 
  inapplicable
 
 
  
   201.234.178.62
   Bogotá
   COLOMBIA
   CO
   
   

 http://www.opengis.net/gml/srs/epsg.xml#4326";>
  -75.2833,10.4
 

   
  
 

'.
Traceback (most recent call last):
  File "./ip_xml.py", line 33, in 
info = ip_info("201.234.178.62")
  File "./ip_xml.py", line 23, in ip_info
tree = ET.fromstring(xml)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1301, in XML
parser.feed(text)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1641, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in 
position 456: ordinal not in range(128)




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

On 2014-01-04 21:20, Danny Yoo wrote:

Oh! That's unfortunate! That looks like a bug on the hostip.info
side. Check with them about it.

I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.

[... XML rant about to start. I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach. But I truly dislike XML for being implemented in ways that
are usually not fun to navigate: either the APIs or the encoded data
are usually convoluted enough to make it a chore rather than a
pleasure.

The beginning does look similar:

import xml.etree.ElementTree as ET
import urllib
response =
urllib.urlopen("http://api.hostip.info?ip=201.234.178.62&position=true";)

tree = ET.parse(response)
tree

Up to this point, not so bad. But this is where it starts to look
silly:

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text

'201.234.178.62'

tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text

u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.

More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

tree.find('.//{http://www.opengis.net/gml}coordinates').text

'-75.2833,10.4'
##

which is truly silly. Why is the latitude and longitude not two
separate, structured values? What is this XML buying us here, really
then? I'm convinced that all the extraneous structure and complexity
in XML causes the people who work with it to stop caring, the result
being something that isn't for the benefit of either humans nor
computer programs.

Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant. :P

Not a rant at all.

As it turns out, one of the other things that have interested me of late
is docbook, an xml dialect (I think this is the correct way to express
it.) I've found it very useful and so do not share your distaste for
xml although one can't disagree with the points you've made with regard
to xml as a solution to the problem under discussion.
I've not played with the python xml interfaces before so this will be a
good project for me.

Thanks.
___
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

> then?  I'm convinced that all the extraneous structure and complexity
> in XML causes the people who work with it to stop caring, the result
> being something that isn't for the benefit of either humans nor
> computer programs.


... I'm sorry.  Sometimes I get grumpy when I haven't had a Snickers.

I should not have said the above here.  It isn't factual, and worse,
it insinuates an uncharitable intent to people who I do not know.
There's enough insinuation and insults out there in the world already:
I should not be contributing to those things.  For that, I apologize.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

Oh!  That's unfortunate!  That looks like a bug on the hostip.info
side.  Check with them about it.


I can't get the source code to whatever is implementing the JSON
response, so I can not say why the city is not being properly included
there.


[... XML rant about to start.  I am not disinterested, so my apologies
in advance.]

... In that case... I suppose trying the XML output is a possible
approach.  But I truly dislike XML for being implemented in ways that
are usually not fun to navigate: either the APIs or the encoded data
are usually convoluted enough to make it a chore rather than a
pleasure.

The beginning does look similar:

##
>>> import xml.etree.ElementTree as ET
>>> import urllib
>>> response = 
>>> urllib.urlopen("http://api.hostip.info?ip=201.234.178.62&position=true";)
>>> tree = ET.parse(response)
>>> tree

##


Up to this point, not so bad.  But this is where it starts to look silly:

##
>>> tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text
'201.234.178.62'
>>> tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text
u'Bogot\xe1'
##

where we need to deal with XML namespaces, an extra complexity for a
benefit that I have never bought into.


More than that, usually the XML I run into in practice isn't even
properly structured, as is the case with the lat-long value in the XML
output here:

##
>>> tree.find('.//{http://www.opengis.net/gml}coordinates').text
'-75.2833,10.4'
##

which is truly silly.  Why is the latitude and longitude not two
separate, structured values?  What is this XML buying us here, really
then?  I'm convinced that all the extraneous structure and complexity
in XML causes the people who work with it to stop caring, the result
being something that isn't for the benefit of either humans nor
computer programs.


Hence, that's why I prefer JSON: JSON export is usually a lot more
sensible, for reasons that I can speculate on, but I probably should
stop this rant.  :P
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread eryksun

On Sat, Jan 4, 2014 at 11:16 PM, Alex Kleider  wrote:
> {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code':
> u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'}
>
> If I use my own IP the city comes in fine so there must still be some
> problem with the encoding.

Report a bug in their JSON API. It's returning b'"city":null'. I see
the same problem for www.msj.go.cr in San José, Costa Rica. It's
probably broken for all non-ASCII byte strings.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 18:44, Danny Yoo wrote:

Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##

import json
import urllib
response = urllib.urlopen('http://api.hostip.info/get_json.php')
info = json.load(response)
info

{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#




This strikes me as being the most elegant solution to date, and I thank 
you for it!


The problem is that the city name doesn't come in:

alex@x301:~/Python/Parse$ cat tutor.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-
# file: 'tutor.py'
"""
Put your docstring here.
"""
print "Running 'tutor.py'..."

import json
import urllib
response = urllib.urlopen\
 ('http://api.hostip.info/get_json.php?ip=201.234.178.62&position=true')
info = json.load(response)
print info

alex@x301:~/Python/Parse$ ./tutor.py
Running 'tutor.py'...
{u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', 
u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': 
u'-75.2833'}


If I use my own IP the city comes in fine so there must still be some 
problem with the encoding.

should I be using
encoding = response.headers.getparam('charset')
in there somewhere?



Any ideas?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Dave Angel

On Sat, 04 Jan 2014 18:31:13 -0800, Alex Kleider  
wrote:

exactly what the line
# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using 
vim 

and I assume things are encoded as ascii?


I don't know vim specifically,  but I'm 99% sure it will let you 
specify the encoding,. Certainly emacs does, so I'd not expect vim to 
fall behind on such a fundamental point.   Anyway it's also likely 
that it defaults to utf for new files.  Anyway your job is to make 
sure that the encoding line matches what the editor is using.  Emacs 
also looks in the first few lines for that same encoding line, so if 
you format it carefully, it'll just work. Easy to test anyway for 
yourself.  Just paste some international characters into a literal 
string.


--
DaveA

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

You were asking earlier about the line:

# -*- coding : utf -8 -*-

See PEP 263:

http://www.python.org/dev/peps/pep-0263/
http://docs.python.org/release/2.3/whatsnew/section-encodings.html

It's a line that tells Python how to interpret the bytes of your
source program.  It allows us to write unicode literal strings
embedded directly in the program source itself.
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

Hi Alex,


According to:

http://www.hostip.info/use.html

there is a JSON-based interface.  I'd recommend using that one!  JSON
is a format that's easy for machines to decode.  The format you're
parsing is primarily for humans, and who knows if that will change in
the future to make it easier to read?

Not only is JSON probably more reliable to parse, but the code itself
should be fairly straightforward.  For example:

#
## In Python 2.7
##
>>> import json
>>> import urllib
>>> response = urllib.urlopen('http://api.hostip.info/get_json.php')
>>> info = json.load(response)
>>> info
{u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA',
u'country_code': u'US', u'ip': u'216.239.45.81'}
#


Best of wishes!
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

A heartfelt thank you to those of you that have given me much to ponder 
with your helpful responses.
In the mean time I've rewritten my procedure using a different approach 
all together.  I'd be interested in knowing if you think it's worth 
keeping or do you suggest I use your revisions to my original hack?


I've been maintaining both a Python3 and a Python2.7 version.  The 
latter has actually opened my eyes to more complexities. Specifically 
the need to use unicode strings rather than Python2.7's default ascii.


Here it is:
alex@x301:~/Python/Parse$ cat ip_info.py
#!/usr/bin/env python
# -*- coding : utf -8 -*-

import re
import urllib2

url_format_str = \
u'http://api.hostip.info/get_html.php?ip=%s&position=true'

info_exp = r"""
Country:[ ](?P.*)
[\n]
City:[ ](?P.*)
[\n]
[\n]
Latitude:[ ](?P.*)
[\n]
Longitude:[ ](?P.*)
[\n]
IP:[ ](?P.*)
"""
info_pattern = re.compile(info_exp, re.VERBOSE).search

def ip_info(ip_address):
"""
Returns a dictionary keyed by Country, City, Lat, Long and IP.

Depends on http://api.hostip.info (which returns the following:
'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude:
38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.)
THIS COULD BREAK IF THE WEB SITE GOES AWAY!!!
"""
response =  urllib2.urlopen(url_format_str %\
   (ip_address, ))
encoding = response.headers.getparam('charset')

info = info_pattern(response.read().decode(encoding))
return {"Country" : unicode(info.group("country")),
"City" : unicode(info.group("city")),
"Lat" : unicode(info.group("lat")),
"Lon" : unicode(info.group("lon")),
"IP" : unicode(info.group("ip"))}

if __name__ == "__main__":
print """IP address is %(IP)s:
Country: %(Country)s;  City: %(City)s.
Lat/Long: %(Lat)s/%(Lon)s""" % ip_info("201.234.178.62")

Apart from soliciting your general comments, I'm also interested to know 
exactly what the line

# -*- coding : utf -8 -*-
really indicates or more importantly, is it true, since I am using vim 
and I assume things are encoded as ascii?


I've discovered that with Ubuntu it's very easy to switch from English 
(US) to English (US, international with dead keys) with just two clicks 
so thanks for that tip as well.




___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

On Sat, Jan 04, 2014 at 04:15:30PM -0800, Alex Kleider wrote:

> >py> 'Bogotá'.encode('utf-8')
> 
> I'm interested in knowing how you were able to enter the above line 
> (assuming you have a key board similar to mine.)

I'm running Linux, and I use the KDE or Gnome character selector, 
depending on which computer I'm using. They give you a graphical window 
showing a screenful of characters at a time, depending on which 
application I'm using you can search for characters by name or property, 
then copy them into the clipboard to paste them into another 
application.

I can also use the "compose" key. My keyboard doesn't have an actual key 
labelled compose, but my system is set to use the right-hand Windows key 
(between Alt and the menu key) as the compose key. (Why the left-hand 
Windows key isn't set to do the same thing is a mystery to me.) So if I 
type:

 'a

I get á.

The problem with the compose key is that it's not terribly intuitive. 
Sure, a few of them are:  1 2 gives ½ but how do I get π (pi)? 
 p doesn't work.

-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

Following my previous email...

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
> Any suggestions as to a better way to handle the problem of encoding in 
> the following context would be appreciated.  The problem arose because 
> 'Bogota' is spelt with an acute accent on the 'a'.

Eryksun has given the right answer for how to extract the encoding from 
the webpage's headers. That will help 9 times out of 10. But 
unfortunately sometimes webpages will lack an encoding header, or they 
will lie, or the text will be invalid for that encoding. What to do 
then?

Let's start by factoring out the repeated code in your giant for-loop 
into something more manageable and maintainable:

> sp = response.splitlines()
> country = city = lat = lon = ip = ''
> for item in sp:
> if item.startswith(b"Country:"):
> try:
> country = item[9:].decode('utf-8')
> except:
> print("Exception raised.")
> country = item[9:]
> elif item.startswith(b"City:"):
> try:
> city = item[6:].decode('utf-8')
> except:
> print("Exception raised.")
> city = item[6:]

and so on, becomes:

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
country = city = lat = lon = ip = ''
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key == 'Country':
country = value
elif key == 'City':
city = value
elif key == 'Latitude':
lat = value
elif key = "Longitude":
lon = value
elif key = 'IP':
ip = value
else:
raise ValueError('unknown key "%s" found' % key)
return {"Country" : country,
"City" : city,
"Lat" : lat,
"Long" : lon,
"IP" : ip
}

But we can do better than that!

encoding = ...  # as per Eryksun's email
sp = response.splitlines()
record = {"Country": None, "City": None, "Latitude": None, 
  "Longitude": None, "IP": None}
for item in sp:
key, value = item.split(':', 1)
key = key.decode(encoding).strip()
value = value.decode(encoding).strip()
if key in record:
record[key] = value
else:
raise ValueError('unknown key "%s" found' % key)
if None in list(record.values()):
for key, value in record.items():
if value is None: break
raise ValueError('missing key in record: %s' % key)
return record

This simplifies the code a lot, and adds some error-handling. It may be 
appropriate for your application to handle missing keys by using some 
default value, such as an empty string, or some other value that cannot 
be mistaken for an actual value, say "*missing*". But since I don't know 
your application's needs, I'm going to leave that up to you. Better to 
start strict and loosen up later, than start too loose and never realise 
that errors are occuring.

I've also changed the keys "Lat" and "Lon" to "Latitude" and 
"Longitude". If that's a problem, it's easy to fix. Just before 
returning the record, change the key:

record['Lat'] = record.pop('Latitude')

and similar for Longitude.

Now that the code is simpler to read and maintain, we can start dealing 
with the risk that the encoding will be missing or wrong.

A missing encoding is easy to handle: just pick a default encoding, and 
hope it is the right one. UTF-8 is a good choice. (It's the only 
*correct* choice, everybody should be using UTF-8, but alas they often 
don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header 
is missing, and you should be good.

How to deal with incorrect encodings? That can happen when the website 
creator *thinks* they are using a certain encoding, but somehow invalid 
bytes for that encoding creep into the data. That gives us a few 
different strategies:

(1) The third-party "chardet" module can analyse text and try to guess 
what encoding it *actually* is, rather than what encoding it claims to 
be. This is what Firefox and other web browsers do, because there are an 
awful lot of shitty websites out there. But it's not foolproof, so even 
if it guesses correctly, you still have to deal with invalid data.

(2) By default, the decode method will raise an exception. You can catch 
the exception and try again with a different encoding:

for codec in (encoding, 'utf-8', 'latin-1'):
try:
key = key.decode(codec)
except UnicodeDecodeError:
pass
else:
break

Latin-1 should be last, because it has the nice property that it will 
*always* succeed. That doesn't mean it will give you the right 
characters, as intended by the person who wrote the website, just that 
it will always give

Re: [Tutor] encoding question

2014-01-04 Thread eryksun

On Sat, Jan 4, 2014 at 7:15 PM, Alex Kleider  wrote:
>>
>> py> 'Bogotá'.encode('utf-8')
>
> I'm interested in knowing how you were able to enter the above line
> (assuming you have a key board similar to mine.)

I use an international keyboard layout:

https://en.wikipedia.org/wiki/QWERTY#US-International

One could also copy and paste from a printed literal:

>>> 'Bogot\xe1'
'Bogotá'

Or more verbosely:

>>> 'Bogot\N{latin small letter a with acute}'
   'Bogotá'
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question


On 2014-01-04 15:52, Steven D'Aprano wrote:


Oh great. An exception was raised. What sort of exception? What error
message did it have? Why did it happen? Nobody knows, because you throw
it away.

Never, never, never do this. If you don't understand an exception, you
have no business covering it up and hiding that it took place. Never 
use

a bare try...except, always catch the *smallest* number of specific
exception types that make sense. Better is to avoid catching exceptions
at all: an exception (usually) means something has gone wrong. You
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare." -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash,
it just covers up the fact that an exception occured.



The output I get on an Ubuntu 12.4LTS system is as follows:
alex@x301:~/Python/Parse$ ./IP_info.py3
Exception raised.
IP address is 201.234.178.62:
Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
Lat/Long: 10.4/-75.2833


I would have thought that utf-8 could handle the 'a-acute'.


Of course it can:

py> 'Bogotá'.encode('utf-8')


I'm interested in knowing how you were able to enter the above line 
(assuming you have a key board similar to mine.)




b'Bogot\xc3\xa1'

py> b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'


But you don't have UTF-8. You have something else, and trying to decode
it using UTF-8 fails.

py> b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5:
unexpected end of data


More to follow...



I very much agree with your remarks.
In a pathetic attempt at self defence I just want to mention that what I 
presented wasn't what I thought was a solution.
Rather it was an attempt to figure out what the problem was as a 
preliminary step to fixing it.

With help from you and others, I was successful in doing this.
And for that help, I thank all list participants very much.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question

2014-01-04 Thread Steven D'Aprano

On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote:
> Any suggestions as to a better way to handle the problem of encoding in 
> the following context would be appreciated.

Python gives you lots of useful information when errors occur, but 
unfortunately your code throws that information away and replaces it 
with a totally useless message:

> try:
> country = item[9:].decode('utf-8')
> except:
> print("Exception raised.")

Oh great. An exception was raised. What sort of exception? What error 
message did it have? Why did it happen? Nobody knows, because you throw 
it away.

Never, never, never do this. If you don't understand an exception, you 
have no business covering it up and hiding that it took place. Never use 
a bare try...except, always catch the *smallest* number of specific 
exception types that make sense. Better is to avoid catching exceptions 
at all: an exception (usually) means something has gone wrong. You 
should aim to fix the problem *before* it blows up, not after.

I'm reminded of a quote:

"I find it amusing when novice programmers believe their main job is
preventing programs from crashing. ... More experienced programmers
realize that correct code is great, code that crashes could use
improvement, but incorrect code that doesn't crash is a horrible
nightmare." -- Chris Smith

Your code is incorrect, it does the wrong thing, but it doesn't crash, 
it just covers up the fact that an exception occured.

> The output I get on an Ubuntu 12.4LTS system is as follows:
> alex@x301:~/Python/Parse$ ./IP_info.py3
> Exception raised.
> IP address is 201.234.178.62:
> Country: COLOMBIA (CO);  City: b'Bogot\xe1'.
> Lat/Long: 10.4/-75.2833
> 
> 
> I would have thought that utf-8 could handle the 'a-acute'.

Of course it can:

py> 'Bogotá'.encode('utf-8')
b'Bogot\xc3\xa1'

py> b'Bogot\xc3\xa1'.decode('utf-8')
'Bogotá'

But you don't have UTF-8. You have something else, and trying to decode 
it using UTF-8 fails.

py> b'Bogot\xe1'.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: 
unexpected end of data

More to follow...

-- 
Steven
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] encoding question