Hey David, I'm on a Mac as well but have never had to tweak anything to get [R]Selenium to work (but this is one reason I try to avoid solutions involving RSelenium as they are pretty fragile IMO).
The site itself has "Página 1 de 69" at the top which is where i got the "69" from and I just re-ran the code in a 100% clean env (on a completely different Mac) and it worked fine. I did neglect to put my session info up before (apologies): Session info ------------------------------------------------------------------------------------ setting value version R version 3.3.0 RC (2016-05-01 r70572) system x86_64, darwin13.4.0 ui RStudio (0.99.1172) language (EN) collate en_US.UTF-8 tz America/New_York date 2016-05-11 Packages ---------------------------------------------------------------------------------------- package * version date source assertthat 0.1 2013-12-06 CRAN (R 3.3.0) bitops * 1.0-6 2013-08-17 CRAN (R 3.3.0) caTools 1.17.1 2014-09-10 CRAN (R 3.3.0) DBI 0.4 2016-05-02 CRAN (R 3.3.0) devtools * 1.11.1 2016-04-21 CRAN (R 3.3.0) digest 0.6.9 2016-01-08 CRAN (R 3.3.0) dplyr * 0.4.3 2015-09-01 CRAN (R 3.3.0) httr 1.1.0 2016-01-28 CRAN (R 3.3.0) magrittr 1.5 2014-11-22 CRAN (R 3.3.0) memoise 1.0.0 2016-01-29 CRAN (R 3.3.0) pbapply * 1.2-1 2016-04-19 CRAN (R 3.3.0) R6 2.1.2 2016-01-26 CRAN (R 3.3.0) Rcpp 0.12.4 2016-03-26 CRAN (R 3.3.0) RCurl * 1.95-4.8 2016-03-01 CRAN (R 3.3.0) RJSONIO * 1.3-0 2014-07-28 CRAN (R 3.3.0) RSelenium * 1.3.5 2014-10-26 CRAN (R 3.3.0) rvest * 0.3.1 2015-11-11 CRAN (R 3.3.0) selectr 0.2-3 2014-12-24 CRAN (R 3.3.0) stringi 1.0-1 2015-10-22 CRAN (R 3.3.0) stringr 1.0.0 2015-04-30 CRAN (R 3.3.0) withr 1.0.1 2016-02-04 CRAN (R 3.3.0) XML * 3.98-1.4 2016-03-01 CRAN (R 3.3.0) xml2 * 0.1.2 2015-09-01 CRAN (R 3.3.0) (and, wow, does that tiny snippet of code end up using alot of pkgs) I had actually started with smaller snippets to test. The code got uglier due to the way the site paginates (it loads 10-entries worth of data on to a single page but requires a server call for the next 10). I also keep firefox scarily out-of-date (back in the 33's rev) b/c I only use it with RSelenium (not a big fan of the browser). Let me update to the 46-series and see if I can replicate. -Bob On Wed, May 11, 2016 at 1:48 PM, David Winsemius <dwinsem...@comcast.net> wrote: > >> On May 10, 2016, at 1:11 PM, boB Rudis <b...@rudis.net> wrote: >> >> Unfortunately, it's a wretched, vile, SharePoint-based site. That >> means it doesn't use traditional encoding methods to do the pagination >> and one of the only ways to do this effectively is going to be to use >> RSelenium: >> >> library(RSelenium) >> library(rvest) >> library(dplyr) >> library(pbapply) >> >> URL <- >> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx" >> >> checkForServer() >> startServer() >> remDr <- remoteDriver$new() >> remDr$open() > > Thanks Bob/hrbrmstr; > > At this point I got an error: > >> startServer() >> remDr <- remoteDriver$new() >> remDr$open() > [1] "Connecting to remote server" > Undefined error in RCurl call.Error in queryRD(paste0(serverURL, "/session"), > "POST", qdata = toJSON(serverOpts)) : > > Running R 3.0.0 on a Mac (El Cap) in the R.app GUI. > $ java -version > java version "1.8.0_65" > Java(TM) SE Runtime Environment (build 1.8.0_65-b17) > Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode) > > I asked myself: What additional information is needed to debug this? But then > I thought I had a responsibility to search for earlier reports of this error > on a Mac, and there were many. After reading this thread: > https://github.com/ropensci/RSelenium/issues/54 I decided to try creating an > "alias", mac-speak for a symlink, and put that symlink in my working > directory (with no further chmod security efforts). I restarted R and re-ran > the code which opened a Firefox browser window and then proceeded to page > through many pages. Eventually, however it errors out with this message: > >> pblapply(1:69, function(i) { > + > + if (i %in% seq(1, 69, 10)) { > + pg <- read_html(remDr$getPageSource()[[1]]) > + ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE) > + > + } else { > + ref <- remDr$findElements("xpath", > + sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']", > + i)) > + ref[[1]]$clickElement() > + pg <- read_html(remDr$getPageSource()[[1]]) > + ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE) > + > + } > + if ((i %% 10) == 0) { > + ref <- remDr$findElements("xpath", ".//a[.='...']") > + ref[[length(ref)]]$clickElement() > + } > + > + ret > + > + }) -> tabs > |+++++++++++ | 22% ~54s > Error in html_nodes(pg, "table")[[3]] : subscript out of bounds >> >> final_dat <- bind_rows(tabs) > Error in bind_rows(tabs) : object 'tabs' not found > > > There doesn't seem to be any trace of objects from all the downloading > efforts that I could find. When I changed both instances of '69' to '30' it > no longer errors out. Is there supposed to be an initial step of finding out > how many pages are actually there befor setting the two iteration limits? I'm > wondering if that code could be modified to return some intermediate values > that would be amenable to further assembly efforts in the event of errors? > > Sincerely; > David. > > >> remDr$navigate(URL) >> >> pblapply(1:69, function(i) { >> >> if (i %in% seq(1, 69, 10)) { >> >> # the first item on the page is not a link but we can just grab the >> page >> >> pg <- read_html(remDr$getPageSource()[[1]]) >> ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE) >> >> } else { >> >> # we can get the rest of them by the link text directly >> >> ref <- remDr$findElements("xpath", >> sprintf(".//a[contains(@href, 'javascript:__doPostBack') and .='%s']", >> i)) >> ref[[1]]$clickElement() >> pg <- read_html(remDr$getPageSource()[[1]]) >> ret <- html_table(html_nodes(pg, "table")[[3]], header=TRUE) >> >> } >> >> # we have to move to the next actual page of data after every 10 links >> >> if ((i %% 10) == 0) { >> ref <- remDr$findElements("xpath", ".//a[.='...']") >> ref[[length(ref)]]$clickElement() >> } >> >> ret >> >> }) -> tabs >> >> final_dat <- bind_rows(tabs) >> final_dat <- final_dat[, c(1, 2, 5, 7, 8, 13, 14)] # the cols you want >> final_dat <- final_dat[complete.cases(final_dat),] # take care of NAs >> >> remDr$quit() >> >> >> Prbly good ref code to have around, but you can grab the data & code >> here: https://gist.github.com/hrbrmstr/ec35ebb32c3cf0aba95f7bad28df1e98 >> >> (anything to help a fellow parent out :-) >> >> -Bob >> >> On Tue, May 10, 2016 at 2:45 PM, Michael Friendly <frien...@yorku.ca> wrote: >>> This is my first attempt to try R web scraping tools, for a project my >>> daughter is working on. It concerns a data base of projects in Sao >>> Paulo, Brazil, listed at >>> http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx, >>> but spread out over 69 pages accessed through a javascript menu at the >>> bottom of the page. >>> >>> Each web page contains 3 HTML tables, of which only the last contains >>> the relevant data. In this, only a subset of columns are of interest. >>> I tried using the XML package as illustrated on several tutorial pages, >>> as shown below. I have no idea how to automate this to extract these >>> tables from multiple web pages. Is there some other package better >>> suited to this task? Can someone help me solve this and other issues? >>> >>> # Goal: read the data tables contained on 69 pages generated by the link >>> below, where >>> # each page is generated by a javascript link in the menu of the bottom >>> of the page. >>> # >>> # Each "page" contains 3 html tables, with names "Table 1", "Table 2", >>> and the only one >>> # of interest with the data, "grdRelSitGeralProcessos" >>> # >>> # From each such table, extract the following columns: >>> #- Processo >>> #- Endereço >>> #- Distrito >>> #- Area terreno (m2) >>> #- Valor contrapartida ($) >>> #- Area excedente (m2) >>> >>> # NB: All of the numeric fields use "." as comma-separator and "," as >>> the decimal separator, >>> # but because of this are read in as character >>> >>> >>> library(XML) >>> link <- >>> "http://outorgaonerosa.prefeitura.sp.gov.br/relatorios/RelSituacaoGeralProcessos.aspx" >>> >>> saopaulo <- htmlParse(link) >>> saopaulo.tables <- readHTMLTable(saopaulo, stringsAsFactors = FALSE) >>> length(saopaulo.tables) >>> >>> # its the third table on this page we want >>> sp.tab <- saopaulo.tables[[3]] >>> >>> # columns wanted >>> wanted <- c(1, 2, 5, 7, 8, 13, 14) >>> head(sp.tab[, wanted]) >>> >>>> head(sp.tab[, wanted]) >>> Proposta Processo Endereço Distrito >>> 1 1 2002-0.148.242-4 R. DOMINGOS LOPES DA SILVA X R. CORNÉLIO >>> VAN CLEVE VILA ANDRADE >>> 2 2 2003-0.129.667-3 AV. DR. JOSÉ HIGINO, >>> 200 E 216 AGUA RASA >>> 3 3 2003-0.065.011-2 R. ALIANÇA LIBERAL, >>> 980 E 990 VILA LEOPOLDINA >>> 4 4 2003-0.165.806-0 R. ALIANÇA LIBERAL, >>> 880 E 886 VILA LEOPOLDINA >>> 5 5 2003-0.139.053-0 R. DR. JOSÉ DE ANDRADE >>> FIGUEIRA, 111 VILA ANDRADE >>> 6 6 2003-0.200.692-0 R. JOSÉ DE >>> JESUS, 66 VILA SONIA >>> à rea Terreno (m2) à rea Excedente (m2) Valor Contrapartida (R$) >>> 1 0,00 1.551,14 127.875,98 >>> 2 0,00 3.552,13 267.075,77 >>> 3 0,00 624,99 70.212,93 >>> 4 0,00 395,64 44.447,18 >>> 5 0,00 719,68 41.764,46 >>> 6 0,00 446,52 85.152,92 >>> >>> thanks, >>> >>> >>> -- >>> Michael Friendly Email: friendly AT yorku DOT ca >>> Professor, Psychology Dept. & Chair, Quantitative Methods >>> York University Voice: 416 736-2100 x66249 Fax: 416 736-5814 >>> 4700 Keele Street Web:http://www.datavis.ca >>> Toronto, ONT M3J 1P3 CANADA >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.