Here is my code. 

there are three method to get text to be parded by htmlParse function. 

1.file on mycomputer 
options(encoding="gbk") 
library(XML) 
xmltext1 <- htmlParse("/home/tiger/Desktop/27174.htm" ) 

#/home/tiger/Desktop/27174.htm is the file of 
http://www.jb51.net/article/27174.htm downloaded on my computer. 

2.url 
options(encoding="gbk") 
library(XML) 
xmltext2 <- htmlParse("http://www.jb51.net/article/27174.htm"; ) 

3.readLines 
options(encoding="gbk") 
library(XML) 
txt=readLines("http://www.jb51.net/article/27174.htm";) 
xmltext3 <- htmlParse(txt,asText=TRUE) 

method1,and method2  are ok,they can get right content to be parsed. 
when i run method 3 ,to my surprise ,xmltext3 can get some  contents,but many 
are gone,they are not the same as method1,and  method2,why? 
you can get only little part of html. 
> xmltext3 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="zh-cn"><head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>PYTHONæ­£å</title>
</head></html>
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to