I am web scraping a page at

http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=

>From this url, I have built up a dataframe through the following code:

dflist <- map(.x = 1:417, .f = function(x) {
 Sys.sleep(5)
 url <- 
("http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=";)
read_html(url) %>%
html_nodes(".title a") %>%
html_text() %>%
as.data.frame()
}) %>% do.call(rbind, .)

I have repeated the same code in order to get all the data I was interested in 
and it seems to work perfectly, although is of course a little slow due to the 
Sys.sleep() thing.

My issue has raised once I have tried to scrape the single projects 
descriptions that should be included in the dataframe.

For instance, the first project description is at

http://catalog.ihsn.org/index.php/catalog/7118/study-description

the second project description is at

http://catalog.ihsn.org/index.php/catalog/6606/study-description

and so forth.

My problem is that I can't find a dynamic way to scrape all the projects' pages 
and insert them in the data frame, being the number in the URLs not progressive 
nor at the end of the link.

To make things clearer, this is the structure of the website I am scraping:

1.http://catalog.ihsn.org/index.php/catalog#_r=&collection=&country=&dtype=&from=1890&page=1&ps=100&sid=&sk=&sort_by=nation&sort_order=&to=2017&topic=&view=s&vk=
   1.1.   http://catalog.ihsn.org/index.php/catalog/7118
        1.1.a http://catalog.ihsn.org/index.php/catalog/7118/related_materials
        1.1.b http://catalog.ihsn.org/index.php/catalog/7118/study-description
        1.1.c. http://catalog.ihsn.org/index.php/catalog/7118/data_dictionary

I have scraped successfully level 1. but cannot level 1.1.b. 
(study-description) , the one I am interested in, since the dynamic element of 
the URL (in this case: 7118) is not consistent in the website's above 6000 
pages of that level.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to