Re: umlauts

Diez B. Roggisch Sat, 17 Oct 2009 10:06:54 -0700

MRAB schrieb:

Arian Kuschki wrote:
Hi all
this has been bugging me for a long time and I do not seem to be ableto understand what to do. I always have problems when dealing inputtext that contains umlauts. Consider the following:
In [1]: import urllib
In [2]: f =urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen";)
In [3]: xml = f.read()

In [4]: f.close()

In [5]: print xml
------> print(xml)
<?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0"tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0"
<forecast_information><cit
y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6data=""/><longitude_e6 data=""/><forecast_datedata="2009-10-17"/><current_date_time data="2009-10-17 14:20:00 +0000"/><unit_systemdata="SI"/></forecast_information><current_conditions><conditiondata="Meistens bew�kt"/><temp_f data="43"/><temp_c data="6"/><humidity data="Feuchtigkeit: 87�%"/><icondata="/ig/images/weather/mostly_cloudy.gif"/><wind_conditiondata="Wind: W mit Windgeschwindigkeiten von 13 km/h"/></current_conditions><forecast_conditions><day_of_week data="Sa."/><lowdata="1"/><high data="7"/><icondata="/ig/images/weather/chance_of_rain.gif"/><condition data="VereinzeltRegen"/></forecast_conditions><forecast_conditions><day_of_weekdata="So."/><low data="-1"/><high data="8"/><icondata="/ig/images/weather/chance_of_snow.gif"/><condition data="VereinzeltSchnee"/></forecast_conditions><forecast_conditions><day_of_weekdata="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/images/weather/mostly_sunny.gif"/><condition data="Teilssonnig"/></forecast_conditions><forecast_conditions><day_of_weekdata="Di."/><low data="0"/><high data="8"/><icon data="/ig/images/weather/sunny.gif"/><conditiondata="Klar"/></forecast_conditions></weather></xml_api_reply>
As you can see the umlauts in the XML are not displayed properly. WhenI want to process this text (for example with xml.sax), I get errormessages because the parses can't read this.
I've tried to read up on this and there is a lot of information on theweb, but nothing seems to work for me. For example setting the codingto UTF like this: # -*- coding: utf-8 -*- or using the decode() stringmethod.
I always have this kind of problem when input contains umlauts, notjust in this case. My locale (on Ubuntu) is en_GB.UTF-8.
The string you received from the website is a bytestring and you're just
printing it to your console, which is configured for UTF-8. However, the
bytestring isn't valid UTF-8, so the console is replacing the invalid
parts with the funny characters.

This is wierd. I looked at the site in FireFox - and it was displayedcorrectly, including umlauts. Bringing up the info-dialog claims thepage is UTF-8, the XML itself says so as well (implicit, through themissing declaration of an encoding) - but it clearly is *not* utf-8.


One would expect google to be better at this...

Diez
--
http://mail.python.org/mailman/listinfo/python-list

Re: umlauts

Reply via email to