Re: [Tutor] converting encoded symbols from rss feed?
> OK, so newline is unicode, outfile.write() wants a plain string. What > encoding do you want outfile to be in? Try something like > outfile.write(newline.encode('utf-8')) > or use the codecs module to create an output that knows how to encode. Aha!! The second of the two options above did the trick! It appears I needed to open my "outfile" with utf-8 encoding. After that, I was able to write out cleaned lines without any hitches. Below is the working code. And of course, many thanks for the help!! infile = open('test.txt','rb') #infile = codecs.open('test.txt','rb','utf-8') outfile = codecs.open('test_cleaned.txt','wb','utf-8') for line in infile: cleanline = strip_html(translate_code(line)).strip() if cleanline: outline = cleanline + '\n' outfile.write(outline) else: continue ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
On Thu, Jun 18, 2009 at 9:03 PM, Serdar Tumgoren wrote: > When I run this code: > > <<< snip >>> > for line in infile: > cleanline = translate_code(line) > newline = strip_html(cleanline) > outfile.write(newline) > <<< snip >>> > > ...I receive the below traceback: > > Traceback (most recent call last): > File "htmlcleanup.py", line 112, in > outfile.write(newline) > UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in > position 21: ordinal not in range(128) OK, so newline is unicode, outfile.write() wants a plain string. What encoding do you want outfile to be in? Try something like outfile.write(newline.encode('utf-8')) or use the codecs module to create an output that knows how to encode. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
Ok, I should say that I managed to "solve" the problem by first reading and translating the data, and then applying Mr. Lundh's strip_html function to the resulting lines. For future reference (and of course any additional feedback), the working code is here: http://pastebin.com/f309bf607 But of course that's a Band-Aid approach and I'm still interested in understanding the root of the problem. To that end, I've attached the Exception below from the problematic code. > Your try/except is hiding the problem. What happens if you take it > out? what error do you get? > > My guess is that strip_html() is returning unicode and > translate_code() is expecting strings but I'm not sure without seeing > the error. > When I run this code: <<< snip >>> for line in infile: cleanline = translate_code(line) newline = strip_html(cleanline) outfile.write(newline) <<< snip >>> ...I receive the below traceback: Traceback (most recent call last): File "htmlcleanup.py", line 112, in outfile.write(newline) UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 21: ordinal not in range(128) ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
2009/6/18 Serdar Tumgoren : >> In [7]: print x.encode('cp437') >> --> print(x.encode('cp437')) >> abc░ >> > So does this mean that my python install is incapable of encoding the > en/em dash? No, the problem is with the print, not the encoding. Your console, as configured, is incapable of displaying the em dash. > But for some reason, I can't seem to get my translate_code function to > work inside the same loop as Mr. Lundh's html cleanup code. Below is > the problem code: > > infile = open('test.txt','rb') > outfile = open('test_cleaned.txt','wb') > > for line in infile: > try: > newline = strip_html(line) > cleanline = translate_code(newline) > outfile.write(cleanline) > except: > newline = "NOT CLEANED: %s" % line > outfile.write(newline) > > infile.close() > outfile.close() > > The strip_html function, documented here > (http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text > string as far as I can tell. I'm confused why I wouldn't be able to > further manipulate the string with the "translate_code" function and > store the result in the "cleanline" variable. When I try this > approach, none of the translations succeed and I'm left with the same > HTML gook in the "outfile". Your try/except is hiding the problem. What happens if you take it out? what error do you get? My guess is that strip_html() is returning unicode and translate_code() is expecting strings but I'm not sure without seeing the error. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
> The example is written assuming the console encoding is utf-8. Yours > seems to be cp437. Try this: > In [1]: import sys > > In [2]: sys.stdout.encoding > Out[2]: 'cp437' That is indeed the result that I get as well. > But there is another problem - \u2013 is an em dash which does not > appear in cp437, so even giving the correct encoding doesn't work. Try > this: > In [6]: x = u"abc\u2591" > > In [7]: print x.encode('cp437') > --> print(x.encode('cp437')) > abc░ > So does this mean that my python install is incapable of encoding the en/em dash? For the time being, I've gone with treating the symptom rather than the root problem and created a translate function. def translate_code(text): text = text.replace("‘","'") text = text.replace("’","'") text = text.replace("“",'"') text = text.replace("”",'"') text = text.replace("–","-") text = text.replace("—","--") return text Which of course has led to a new problem. I'm first using Fredrik Lundh's code to extract random html gobbledygook, then running my translate function over the file to replace the windows-1252 encoded characters. But for some reason, I can't seem to get my translate_code function to work inside the same loop as Mr. Lundh's html cleanup code. Below is the problem code: infile = open('test.txt','rb') outfile = open('test_cleaned.txt','wb') for line in infile: try: newline = strip_html(line) cleanline = translate_code(newline) outfile.write(cleanline) except: newline = "NOT CLEANED: %s" % line outfile.write(newline) infile.close() outfile.close() The strip_html function, documented here (http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text string as far as I can tell. I'm confused why I wouldn't be able to further manipulate the string with the "translate_code" function and store the result in the "cleanline" variable. When I try this approach, none of the translations succeed and I'm left with the same HTML gook in the "outfile". Is there some way to combine these functions so I can perform all the processing in one pass? ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
On Thu, Jun 18, 2009 at 4:37 PM, Serdar Tumgoren wrote: > On the above link, the section on "Encoding Unicode Byte Streams" has > the following example: > u = u"abc\u2013" print u > Traceback (most recent call last): > File "", line 1, in > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > position 3: ordinal not in range(128) print u.encode("utf-8") > abc– > > But when I try the same example on my Windows XP machine (with Python > 2.5.4), I can't get the same results. Instead, it spits out the below > (hopefully it renders properly and we don't have encoding issues!!!): > > $ python > Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] > on > win32 > Type "help", "copyright", "credits" or "license" for more information. x = u"abc\u2013" print x > Traceback (most recent call last): > File "", line 1, in > File "C:\Program Files\Python25\lib\encodings\cp437.py", line 12, in encode > return codecs.charmap_encode(input,errors,encoding_map) > UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in > position > 3: character maps to x.encode("utf-8") > 'abc\xe2\x80\x93' print x.encode("utf-8") > abcΓÇô The example is written assuming the console encoding is utf-8. Yours seems to be cp437. Try this: C:\Project\Mango> py Python 2.6.1 (r261:67517, Dec 4 2008, 16:51:00) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. In [1]: import sys In [2]: sys.stdout.encoding Out[2]: 'cp437' But there is another problem - \u2013 is an em dash which does not appear in cp437, so even giving the correct encoding doesn't work. Try this: In [6]: x = u"abc\u2591" In [7]: print x.encode('cp437') --> print(x.encode('cp437')) abc░ > In a related test, I was unable change the default character encoding > for the python interpreter from ascii to utf-8. In all cases (cygwin, > Wing IDE, windows command line), the interpreter reported that I my > "sys" module does not contain the "setdefaultencoding" method (even > though this should be part of the module from versions 2.x and above). sys.defaultencoding is deleted by site.py on python startup.You have to set the default encoding from within a sitecustomize.py module. But it's usually better to get a correct understanding of what is going on and to leave the default encoding alone. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
Hey everyone, I'm trying to get down to basics with this handy intro on Python encodings: http://eric.themoritzfamily.com/2008/11/21/python-encodings-and-unicode/ But I'm running into some VERY strange results. On the above link, the section on "Encoding Unicode Byte Streams" has the following example: >>> u = u"abc\u2013" >>> print u Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3: ordinal not in range(128) >>> print u.encode("utf-8") abc– But when I try the same example on my Windows XP machine (with Python 2.5.4), I can't get the same results. Instead, it spits out the below (hopefully it renders properly and we don't have encoding issues!!!): $ python Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> x = u"abc\u2013" >>> print x Traceback (most recent call last): File "", line 1, in File "C:\Program Files\Python25\lib\encodings\cp437.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 3: character maps to >>> x.encode("utf-8") 'abc\xe2\x80\x93' >>> print x.encode("utf-8") abcΓÇô I get the above results in python interpreters invoked from both the Windows command line and in a cygwin shell. HOWEVER -- the test code works properly (i.e. I get the expected "abc-" when I run the code in WingIDE 10.1 (version 3.1.8-1). In a related test, I was unable change the default character encoding for the python interpreter from ascii to utf-8. In all cases (cygwin, Wing IDE, windows command line), the interpreter reported that I my "sys" module does not contain the "setdefaultencoding" method (even though this should be part of the module from versions 2.x and above). Can anyone help me untangle this mess? I'd be indebted! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
> Some further searching reveals this: > (yay archives ;)) > http://mail.python.org/pipermail/python-list/2008-April/658644.html > Aha! I noticed that 150 was missing from the ISO encoding table and the source xml is indeed using windows-1252 encoding. That explains why this appears to be the only character in the xml source that doesn't seem to get translated by Universal Feed Parser. But I'm now wondering if the feed parser is using windows-1252 rather than some other encoding. The below page provides details on how UFP handles character encodings. http://www.feedparser.org/docs/character-encoding.html I'm wondering if there's a way to figure out which encoding UFP uses when it parses the file. I didn't have the Universal Encoding Detector (http://chardet.feedparser.org/) installed when I parsed the xml file. It's not clear to me whether UFP requires that library to detect the encoding or if it's an optional part of it's broader routine for determining encoding. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
Hey everyone, For the moment, I opted to use string replacement as my "solution." So for the below string containing the HTML decimal represenation for en dash: >>>x = "The event takes place June 17 – 19" >>>x.replace('–', '-') 'The event takes place June 17 - 19' It works in my case since this seems to be the only code that Universal Feed Parser didn't properly translate, but of course not an ideal solution. I assume this path will require me to build a character reference dictionary as I encounter more character codes. I also tried wrestling with character conversion: >>>unichr(150) u'\x96' Not sure where to go from there... ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
> Upon searching for – in google, I came up with this: > http://www.siber-sonic.com/mac/charsetstuff/Soniccharset.html The character table definitely helps. Thanks. Some additional googling suggests that I need to unescape HTML entities. I'm planning to try the below approach from Frederik Lundh. It relies on the "re" and "htmlentitydefs" modules. http://effbot.org/zone/re-sub.htm#unescape-html I'll report back with my results. Meantime, I welcome any other suggestions. Thanks! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] converting encoded symbols from rss feed?
On Wed, Jun 17, 2009 at 7:30 AM, Serdar Tumgoren wrote: > Here are some examples of the encoded characters I'm trying to > convert: > > (symbol as it appears in the original xml file) > – (symbol as it appears in ipython shell after > using Universal Feed Parser) > I've never played around much, but & is the HTML code for an &. So if you have it will show up as –. I have no clue if the latter is any type of special character or something, though. Upon searching for – in google, I came up with this: http://www.siber-sonic.com/mac/charsetstuff/Soniccharset.html HTH, Wayne ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor