Hello all - it's been a while! I'm trying to parse a webpage using lxml; every time I try, I'm rewarded with "UnicodeDecodeError: 'ascii' codec can't decode byte 0x?? in position?????: ordinal not in range(128)" (the byte value and the position occasionally change; the error never does.)
The page's encoding is UTF-8: <meta http-equiv="content-type" content="text/html; charset=utf-8" /> so I have tried: - setting HTMLParser's encoding to 'utf-8' - reading the page first, decoding as 'utf-8', then re-encoding as 'ascii' with options 'replace' or 'ignore' - and various combinations thereof Here's my current version, trying everything at once: from __future__ import print_function import datetime import urllib2 from lxml import etree url = 'http://www.wpc-edi.com/reference/codelists/healthcare/claim-adjustment-reason-codes/' page = urllib2.urlopen(url) pagecontents = page.read() pagecontents = pagecontents.decode('utf-8') pagecontents = pagecontents.encode('ascii', 'ignore') tree = etree.parse(pagecontents, etree.HTMLParser(encoding='utf-8',recover=True)) and here's the result: Traceback (most recent call last): File "etreeTest.py", line 10, in <module> tree = etree.parse(pagecontents, etree.HTMLParser(encoding='utf-8',recover=True)) File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:54187) File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:79485) File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:79768) File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:78843) File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:75698) File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:71739) File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614) File "parser.pxi", line 579, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71894) UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 63953: ordinal not in range(128) Script terminated. I'm at my wit's end: how do I either change HTMLParser's codec to UTF-8, or strip non-ASCII characters out of the stream? What am I missing? Environment: Python 2.7.3, 32bit - on Windows 7 Ultimate, 64bit lxml 2.3 _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor