Dear Sir/Madam:

Thank you for your attention to my question. I have downloaded the source code 
of some web pages by RCurl, and I am trying to extract the URL from them. In 
these web pages, there are many nodes contains the same URL, such like the 
followings:

<a href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/\" 
rel=\"bookmark\">

<a 
href=\"http://blog.shakirm.com/2015/03/a-statistical-view-of-deep-learning-ii-auto-encoders-and-free-energy/\";
 target=\"_blank\">

<a 
href=\"http://cos.name/2015/05/the-data-wisdom-for-data-science/#more-10947\" 
class=\"more-link\">

I want to accurately choose the URL I need(the "href" in the first one), and I 
tried many ways the most accuracy is just like the following:

library(XML)

#links<-getHTMLLinks(base.html, xpQuery = "//a/@href")

links<-getHTMLLinks(base.html, xpQuery = c("//a/href[@rel='bookmark']"))

However, I still believe that there is a correct method to do this very well, 
but I could not find it. I wonder if you could give me some advice on solving 
this problem. And I would be most grateful if you could reply at your earliest 
convenience. Looking forward to hearing from you. Thank you very much.

                                     Sincerely yours 

                                     Humphrey Zhao
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to