Tim Arnold wrote: > "?????? ???????????" <gdam...@gmail.com> wrote in message > news:ciqh56-ses....@archaeopteryx.softver.org.mk... >> So, I'm using lxml to screen scrap a site that uses the cyrillic >> alphabet (windows-1251 encoding). The sites HTML doesn't have the <META >> ..content-type.. charset=..> header, but does have a HTTP header that >> specifies the charset... so they are standards compliant enough. >> >> Now when I run this code: >> >> from lxml import html >> doc = html.parse('http://a1.com.mk/') >> root = doc.getroot() >> title = root.cssselect(('head title'))[0] >> print title.text >> >> the title.text is ? unicode string, but it has been wrongly decoded as >> latin1 -> unicode > > The way I do that is to open the file with codecs, encoding=cp1251, read it > into variable and feed that to the parser.
Yes, if you know the encoding through an external source (especially when parsing broken HTML), it's best to pass in either a decoded string or a decoding file-like object, as in tree = lxml.html.parse( codecs.open(..., encoding='...') ) You can also create a parser with an encoding override: parser = etree.HTMLParser(encoding='...', **other_options) Stefan -- http://mail.python.org/mailman/listinfo/python-list