[R] Authentication and Web Site Scraping

G . Maubach Sat, 21 Jan 2017 02:22:31 -0800

Hi All,

I would like to learn how to scrape a web site which is password protected. I 
do my training with my Delicious web site. I will obey all rules and 
legislation existent.


The delicious export api was shut down. I assume that the web site will be shut 
down in the foreseeable future. In my Coursera Course I learned that it is 
possible to scrape web sites and extract the information in it. I would like to 
use this possibility to download the bookmark pages and extract the bookmarks 
with its accompanying tags as an alternative to the non-existant export api.

I started with

-- cut --
url_base <- "https://del.icio.us/gmaubach?&page=";

data_created <- as.character(Sys.Date())
filename_base <-
  paste0(
    data_created,
    "_Delicious_Page_")

page_start <- 1
page_end <- 670

for (page in seq_along(page_start:page_end))
{
  download.file(
    url = paste0(
      url_base,
      as.character(page)),
    destfile = paste0(
      filename_base,
      as.character(page)))
}
-- cut --

This way approx. 1000 bookmarks are not loaded cause only the public bookmarks 
are shown. I know that it is possible to authenticate using something like

-- cut --
page <- GET("https://del.icio.us";,
           authenticate("user", "password"))
-- cut --

To not have to authenticate over and over again, it is possible to use handles 
like

-- cut --
delicious <- handle("https://del.icio.us";)
-- cut --

I do not know how I have to put it all together. What would be a statement 
sequence in getting all stored booksmarks on the pages 1..670 using 
authentication?

Kind regards

Georg

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Authentication and Web Site Scraping

Reply via email to