Re: [R] web scraping tables generated in multiple server pages

boB Rudis Wed, 11 May 2016 11:03:06 -0700

Hey David,

I'm on a Mac as well but have never had to tweak anything to get
[R]Selenium to work (but this is one reason I try to avoid solutions
involving RSelenium as they are pretty fragile IMO).


The site itself has "Página 1 de 69" at the top which is where i got
the "69" from and I just re-ran the code in a 100% clean env (on a
completely different Mac) and it worked fine.

I did neglect to put my session info up before (apologies):

    Session info
------------------------------------------------------------------------------------
     setting  value
     version  R version 3.3.0 RC (2016-05-01 r70572)
     system   x86_64, darwin13.4.0
     ui       RStudio (0.99.1172)
     language (EN)
     collate  en_US.UTF-8
     tz       America/New_York
     date     2016-05-11

    Packages 
----------------------------------------------------------------------------------------
     package    * version  date       source
     assertthat   0.1      2013-12-06 CRAN (R 3.3.0)
     bitops     * 1.0-6    2013-08-17 CRAN (R 3.3.0)
     caTools      1.17.1   2014-09-10 CRAN (R 3.3.0)
     DBI          0.4      2016-05-02 CRAN (R 3.3.0)
     devtools   * 1.11.1   2016-04-21 CRAN (R 3.3.0)
     digest       0.6.9    2016-01-08 CRAN (R 3.3.0)
     dplyr      * 0.4.3    2015-09-01 CRAN (R 3.3.0)
     httr         1.1.0    2016-01-28 CRAN (R 3.3.0)
     magrittr     1.5      2014-11-22 CRAN (R 3.3.0)
     memoise      1.0.0    2016-01-29 CRAN (R 3.3.0)
     pbapply    * 1.2-1    2016-04-19 CRAN (R 3.3.0)
     R6           2.1.2    2016-01-26 CRAN (R 3.3.0)
     Rcpp         0.12.4   2016-03-26 CRAN (R 3.3.0)
     RCurl      * 1.95-4.8 2016-03-01 CRAN (R 3.3.0)
     RJSONIO    * 1.3-0    2014-07-28 CRAN (R 3.3.0)
     RSelenium  * 1.3.5    2014-10-26 CRAN (R 3.3.0)
     rvest      * 0.3.1    2015-11-11 CRAN (R 3.3.0)
     selectr      0.2-3    2014-12-24 CRAN (R 3.3.0)
     stringi      1.0-1    2015-10-22 CRAN (R 3.3.0)
     stringr      1.0.0    2015-04-30 CRAN (R 3.3.0)
     withr        1.0.1    2016-02-04 CRAN (R 3.3.0)
     XML        * 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
     xml2       * 0.1.2    2015-09-01 CRAN (R 3.3.0)

(and, wow, does that tiny snippet of code end up using alot of pkgs)

I had actually started with smaller snippets to test. The code got
uglier due to the way the site paginates (it loads 10-entries worth of
data on to a single page but requires a server call for the next 10).

I also keep firefox scarily out-of-date (back in the 33's rev) b/c I
only use it with RSelenium (not a big fan of the browser). Let me
update to the 46-series and see if I can replicate.

-Bob

On Wed, May 11, 2016 at 1:48 PM, David Winsemius <dwinsem...@comcast.net> wrote:
>
>> On May 10, 2016, at 1:11 PM, boB Rudis <b...@rudis.net> wrote:
>>
>> Unfortunately, it's a wretched, vile, SharePoint-based site. That
>> means it doesn't use traditional encoding methods to do the pagination
>> and one of the only ways to do this effectively is going to be to use
>> RSelenium:
>>
>>    library(RSelenium)
>>    library(rvest)
>>    library(dplyr)
>>    library(pbapply)
>>
>>    URL <- 
>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx";
>>
>>    checkForServer()
>>    startServer()
>>    remDr <- remoteDriver$new()
>>    remDr$open()
>
> Thanks Bob/hrbrmstr;
>
> At this point I got an error:
>
>>    startServer()
>>    remDr <- remoteDriver$new()
>>    remDr$open()
> [1] "Connecting to remote server"
> Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), 
> "POST", qdata = toJSON(serverOpts)) :
>
> Running R 3.0.0 on a Mac (El Cap) in the R.app GUI.
> $ java -version
> java version "1.8.0_65"
> Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
>
> I asked myself: What additional information is needed to debug this? But then 
> I thought I had a responsibility to search for earlier reports of this error 
> on a Mac, and there were many. After reading this thread: 
> https://github.com/ropensci/RSelenium/issues/54  I decided to try creating an 
> "alias", mac-speak for a symlink, and put that symlink in my working 
> directory (with no further chmod security efforts). I restarted R and re-ran 
> the code which opened a Firefox browser window and then proceeded to page 
> through many pages. Eventually, however it errors out with this message:
>
>>    pblapply(1:69, function(i) {
> +
> +      if (i %in% seq(1, 69, 10)) {
> +        pg <- read_html(remDr$getPageSource()[[1]])
> +        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> +
> +      } else {
> +        ref <- remDr$findElements("xpath",
> + sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
> + i))
> +        ref[[1]]$clickElement()
> +        pg <- read_html(remDr$getPageSource()[[1]])
> +        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
> +
> +      }
> +      if ((i %% 10) == 0) {
> +        ref <- remDr$findElements("xpath", ".//a[.='...']")
> +        ref[[length(ref)]]$clickElement()
> +      }
> +
> +      ret
> +
> +    }) -> tabs
>    |+++++++++++                                       | 22% ~54s          
> Error in html_nodes(pg, "table")[[3]] : subscript out of bounds
>>
>>    final_dat <- bind_rows(tabs)
> Error in bind_rows(tabs) : object 'tabs' not found
>
>
> There doesn't seem to be any trace of objects from all the downloading 
> efforts that I could find. When I changed both instances of '69' to '30' it 
> no longer errors out. Is there supposed to be an initial step of finding out 
> how many pages are actually there befor setting the two iteration limits? I'm 
> wondering if that code could be modified to return some intermediate values 
> that would be amenable to further assembly efforts in the event of errors?
>
> Sincerely;
> David.
>
>
>>    remDr$navigate(URL)
>>
>>    pblapply(1:69, function(i) {
>>
>>      if (i %in% seq(1, 69, 10)) {
>>
>>        # the first item on the page is not a link but we can just grab the 
>> page
>>
>>        pg <- read_html(remDr$getPageSource()[[1]])
>>        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>
>>      } else {
>>
>>        # we can get the rest of them by the link text directly
>>
>>        ref <- remDr$findElements("xpath",
>> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']",
>> i))
>>        ref[[1]]$clickElement()
>>        pg <- read_html(remDr$getPageSource()[[1]])
>>        ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE)
>>
>>      }
>>
>>      # we have to move to the next actual page of data after every 10 links
>>
>>      if ((i %% 10) == 0) {
>>        ref <- remDr$findElements("xpath", ".//a[.='...']")
>>        ref[[length(ref)]]$clickElement()
>>      }
>>
>>      ret
>>
>>    }) -> tabs
>>
>>    final_dat <- bind_rows(tabs)
>>    final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want
>>    final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs
>>
>>    remDr$quit()
>>
>>
>> Prbly good ref code to have around, but you can grab the data & code
>> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98
>>
>> (anything to help a fellow parent out :-)
>>
>> -Bob
>>
>> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <frien...@yorku.ca> wrote:
>>> This is my first attempt to try R web scraping tools, for a project my
>>> daughter is working on.  It concerns a data base of projects in Sao
>>> Paulo, Brazil, listed at
>>> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx,
>>> but spread out over 69 pages accessed through a javascript menu at the
>>> bottom of the page.
>>>
>>> Each web page contains 3 HTML tables, of which only the last contains
>>> the relevant data.  In this, only a subset of columns are of interest.
>>> I tried using the XML package as illustrated on several tutorial pages,
>>> as shown below.  I have no idea how to automate this to extract these
>>> tables from multiple web pages.  Is there some other package better
>>> suited to this task?  Can someone help me solve this and other issues?
>>>
>>> # Goal: read the data tables contained on 69 pages generated by the link
>>> below, where
>>> # each page is generated by a javascript link in the menu of the bottom
>>> of the page.
>>> #
>>> # Each "page" contains 3 html tables, with names "Table 1", "Table 2",
>>> and the only one
>>> # of interest with the data, "grdRelSitGeralProcessos"
>>> #
>>> # From each such table, extract the following columns:
>>> #- Processo
>>> #- Endereço
>>> #- Distrito
>>> #- Area terreno (m2)
>>> #- Valor contrapartida ($)
>>> #- Area excedente (m2)
>>>
>>> # NB: All of the numeric fields use "." as comma-separator and "," as
>>> the decimal separator,
>>> #   but because of this are read in as character
>>>
>>>
>>> library(XML)
>>> link <-
>>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx";
>>>
>>> saopaulo <- htmlParse(link)
>>> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE)
>>> length(saopaulo.tables)
>>>
>>> # its the third table on this page we want
>>> sp.tab <- saopaulo.tables[[3]]
>>>
>>> # columns wanted
>>> wanted <- c(1, 2, 5, 7, 8, 13, 14)
>>> head(sp.tab[, wanted])
>>>
>>>> head(sp.tab[, wanted])
>>>   Proposta Processo EndereÃ§o        Distrito
>>> 1        1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÃ‰LIO
>>> VAN CLEVE    VILA ANDRADE
>>> 2        2 2003-0.129.667-3                      AV. DR. JOSÃ‰ HIGINO,
>>> 200 E 216       AGUA RASA
>>> 3        3 2003-0.065.011-2                       R. ALIANÃ‡A LIBERAL,
>>> 980 E 990 VILA LEOPOLDINA
>>> 4        4 2003-0.165.806-0                       R. ALIANÃ‡A LIBERAL,
>>> 880 E 886 VILA LEOPOLDINA
>>> 5        5 2003-0.139.053-0                R. DR. JOSÃ‰ DE ANDRADE
>>> FIGUEIRA, 111    VILA ANDRADE
>>> 6        6 2003-0.200.692-0                                R. JOSÃ‰ DE
>>> JESUS, 66      VILA SONIA
>>>   Ã rea Terreno (m2) Ã rea Excedente (m2) Valor Contrapartida (R$)
>>> 1               0,00             1.551,14 127.875,98
>>> 2               0,00             3.552,13 267.075,77
>>> 3               0,00               624,99 70.212,93
>>> 4               0,00               395,64 44.447,18
>>> 5               0,00               719,68 41.764,46
>>> 6               0,00               446,52 85.152,92
>>>
>>> thanks,
>>>
>>>
>>> --
>>> Michael Friendly     Email: friendly AT yorku DOT ca
>>> Professor, Psychology Dept. & Chair, Quantitative Methods
>>> York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
>>> 4700 Keele Street    Web:http://www.datavis.ca
>>> Toronto, ONT  M3J 1P3 CANADA
>>>
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] web scraping tables generated in multiple server pages

Reply via email to