Steven Truppe <steven.tru...@chello.at> writes: > type= <type 'str'> title = Wizo - Anderster Full Album - YouTube > type= <type 'str'> title = Wizo - Bleib Tapfer / für'n Arsch Full > Album - YouTube > Traceback (most recent call last): > File "./music-fetcher.py", line 39, in <module> > title = HTMLParser.HTMLParser().unescape(title) > File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape > return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", > replaceEntities, s) > File "/usr/lib/python2.7/re.py", line 155, in sub > return _compile(pattern, flags).sub(repl, string, count) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position > 23: ordinal not in range(128)
This looks like a bug with "HTMLParser" or a usage problem with its "unescape" method. I would use "lxml" in order to parse your HTML. It automatically converts character references (like the above "&39;") and handles special characters (like "ü") adequately. Under Python 2, "lxml" either returns text data as "str" (if the result is fully ascii) or "unicode" (otherwise). -- https://mail.python.org/mailman/listinfo/python-list