Thank you for your response both Martin and Gabor, very much appreciated! In case anyone does a search for this topic, i thought i'd write a few comments below on what I have ended up doing:
re: Internet Explorer (IE) - Finding out that R can access IE was a very pleasant surprise! This works very well at extracting the plain text from a html formatted page. The only downsides for me were (1) it is rather slow if you wish to convert lots of html files into plain text files, even if the html files are already on your computer, and (2) when trying to convert some html files, an IE 'pop-up' window may show up and execution can not continue until that pop up has been dealt with. There may be ways around this, but I am not aware of them. ## This is an example of the code I used: library(RDCOMClient) urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help", "http://wiki.r-project.org/rwiki/doku.php?id=getting- started:what-is-r:what-is-r") ie <- COMCreate("InternetExplorer.Application") txt <- list() for(u in urls) { ie$Navigate(u) while(ie[["Busy"]]) Sys.sleep(1) txt[[u]] <- ie[["document"]][["body"]][["innerText"]] } ie$Quit() print(txt) re: xpathApply() - I must admit that this was a little confusing when I first encountered it after reading your post, but after some reading i think i have found out how to get what i want. This seems to work almost as well as IE above, but i have found this to be faster for my purposes probably because there is no need to wait for an external application, plus there is no danger of a 'pop-up' window showing. As far as i can tell, all plain text is extracted. library(RCurl) library(XML) urls <- c("https://stat.ethz.ch/mailman/listinfo/r-help", "http://wiki.r-project.org/rwiki/doku.php?id=getting- started:what-is-r:what-is-r") html.files <- txt <- list() html.files <- getURL(urls, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) for(u in urls) { html = htmlTreeParse(html.files[[u]], useInternal=TRUE) txt[[u]] <- toString(xpathApply(html, "//body//text() [not(ancestor::script)][not(ancestor::style)]", xmlValue)) } print(txt) Cheers, Tony Breyal On 6 Oct, 16:45, Tony Breyal <[EMAIL PROTECTED]> wrote: > Dear R-help, > > I want to download the text from a web page, however what i end up > with is thehtmlcode. Is there some option that i am missing in the > RCurl package? Or is there another way to achieve this? This is the > code i am using: > > > library(RCurl) > > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help' > >html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, > >followlocation = TRUE) > > print(html.file) > > I thought perhaps the htmlTreeParse() function from the XML package > might help, but I just don't know what to do next with it: > > > library(XML) > > htmlTreeParse(html.file) > > Many thanks for any help you can provide, > Tony Breyal > > > sessionInfo() > > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] XML_1.94-0 RCurl_0.9-4 > > ______________________________________________ > [EMAIL PROTECTED] mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.