Here is my code. there are three method to get text to be parded by htmlParse function.
1.file on mycomputer options(encoding="gbk") library(XML) xmltext1 <- htmlParse("/home/tiger/Desktop/27174.htm" ) #/home/tiger/Desktop/27174.htm is the file of http://www.jb51.net/article/27174.htm downloaded on my computer. 2.url options(encoding="gbk") library(XML) xmltext2 <- htmlParse("http://www.jb51.net/article/27174.htm" ) 3.readLines options(encoding="gbk") library(XML) txt=readLines("http://www.jb51.net/article/27174.htm") xmltext3 <- htmlParse(txt,asText=TRUE) method1,and method2 are ok,they can get right content to be parsed. when i run method 3 ,to my surprise ,xmltext3 can get some contents,but many are gone,they are not the same as method1,and method2,why? you can get only little part of html. > xmltext3 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-cn"><head> <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> <title>PYTHONæ£å</title> </head></html> [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.