Hi,

I am working on developing a web crawler in R and I needed some help with
regard to removal of javascripts and style sheets from the html document of
a web page.

i tried using the xml package, hence the function xpathApply
library(XML)
txt =
xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
xmlValue)

The output comes out as text lines, without any html tags. I want the html
tags to remain intact and scrap only the javascript and styles from it. 

Any help would be highly appreciated.
Thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3413894.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to