i eidt a file and save it in gbk encode named test. my system is :debian,locale,en.utf-8;python2.6,locale,utf-8.
<html> <p>你</p> </html> in terminal i input: xxd test 0000000: 3c68 746d 6c3e 0a3c 703e c4e3 3c2f 703e <html>.<p>..</p> 0000010: 0a3c 2f68 746d 6c3e 0a .</html>. 你 is you in english, "\xc4\xe3" is the gbk encode of it. "\xe4\xbd\xe3" is the utf-8 encode of it. "u\x4f\x60" is the unicode encode of it. now i parse it in lxml >>> "你" '\xe4\xbd\xa0' >>> "你".decode("utf-8") u'\u4f60' >>> "你".decode("utf-8").encode("gbk") '\xc4\xe3' >>> code1: >>> import lxml.html >>> root=lxml.html.parse("test") >>> d=root.xpath("//p") >>> d[0].text_content() u'\xc4\xe3' in material ,lxml parse file to output the unicode form. why the d[0].text_content() can not output u'\x4f\x60'? code2: import codecs import lxml.html f = codecs.open('test', 'r', 'gbk') root=lxml.html.parse(f) d=root.xpath("//p") d[0].text_content() u'\xe4\xbd\xa0' why the d[0].text_content() can not output u'\x4f\x60'? i am confused by this problem for two days.
-- http://mail.python.org/mailman/listinfo/python-list