On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote: > On Fri, 24 Jul 2020, Spencer Graves wrote: > > On 2020-07-23 17:46, William Michels wrote: > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > > > <spencer.gra...@effectivedefense.org> wrote: > > > > Hello, All: > > > > > > > > I've failed with multiple > > > > attempts to scrape the table of > > > > candidates from the website of > > > > the Missouri Secretary of > > > > State: > > > > > > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 > > > > > > Hi Spencer, > > > > > > I tried the code below on an older > > > R-installation, and it works fine. > > > Not a full solution, but it's a > > > start: > > > > > > > library(RCurl) > > > Loading required package: bitops > > > > url <- > > > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > > > M_sos <- getURL(url) > > > > Hi Bill et al.: > > > > That broke the dam: It gave me a > > character vector of length 1 > > consisting of 218 KB. I fed that to > > XML::readHTMLTable and > > purrr::map_chr, both of which > > returned lists of 337 data.frames. > > The former retained names for all > > the tables, absent from the latter. > > The columns of the former are all > > character; that's not true for the > > latter. > > > > Sadly, it's not quite what I want: > > It's one table for each office-party > > combination, but it's lost the > > office designation. However, I'm > > confident I can figure out how to > > hack that. > > Maybe try something like this: > > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > h <- xml2::read_html(url) > tbl <- rvest::html_table(h)
Dear Spencer, I unified the party tables after the first summary table like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" M_sos <- RCurl::getURL(url) saveRDS(object=M_sos, file="dcp.rds") dat <- XML::readHTMLTable(M_sos) idx <- 2:length(dat) cn <- unique(unlist(lapply(dat[idx], colnames))) dat <- do.call(rbind, sapply(idx, function(i, dat, cn) { x <- dat[[i]] x[,cn[!(cn %in% colnames(x))]] <- NA x <- x[,cn] x$Party <- names(dat)[i] return(list(x)) }, dat=dat, cn=cn)) dat[,"Date Filed"] <- as.Date(x=dat[,"Date Filed"], format="%m/%d/%Y") write.table(dat, file="dcp.tsv", sep="\t", row.names=FALSE, quote=TRUE, na="N/A") Best, Rasmus
signature.asc
Description: PGP signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.