Peter Pei wrote:
> I am trying to read a web page and save it in a .html file. The problem is 
> that the web page is GB-2312 encoded, and I want to save it to the file with 
> the same encoding or unicode. I have some code like this:
>     url = 'http://blah/'
>     headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows 
> NT)' }
> 
>     req = urllib2.Request(url, None, headers)
>     page = urllib2.urlopen(req).read()
> 
>     file = open('btchina.html','wb')
>     file.write(page.encode('gb-2312'))
>     file.close()
> 
> It is obviously not working, and I am hoping someone can help me.

.read() returns the bytes exactly how it downloads them. It doesn't
interpret them. If those bytes are GB-2312-encoded text, that's what
they are. There's no need to reencode them. Just .write(page) (of
course, this way you don't verify that it's correct).

(BTW, don't use 'file' as a variable name. It's an alias of the 'open()'
function.)
-- 
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to