On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote:
> On Fri, 24 Jul 2020, Spencer Graves wrote:
> > On 2020-07-23 17:46, William Michels wrote:
> > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> > > <spencer.gra...@effectivedefense.org> wrote:
> > > > Hello, All:
> > > > 
> > > > I've failed with multiple 
> > > > attempts to scrape the table of 
> > > > candidates from the website of 
> > > > the Missouri Secretary of 
> > > > State:
> > > > 
> > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
> > > 
> > > Hi Spencer,
> > > 
> > > I tried the code below on an older 
> > > R-installation, and it works fine.  
> > > Not a full solution, but it's a 
> > > start:
> > > 
> > > > library(RCurl)
> > > Loading required package: bitops
> > > > url <- 
> > > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975";
> > > > M_sos <- getURL(url)
> > 
> > Hi Bill et al.:
> > 
> > That broke the dam:  It gave me a 
> > character vector of length 1 
> > consisting of 218 KB.  I fed that to 
> > XML::readHTMLTable and 
> > purrr::map_chr, both of which 
> > returned lists of 337 data.frames. 
> > The former retained names for all 
> > the tables, absent from the latter.  
> > The columns of the former are all 
> > character;  that's not true for the 
> > latter.
> > 
> > Sadly, it's not quite what I want:  
> > It's one table for each office-party 
> > combination, but it's lost the 
> > office designation. However, I'm 
> > confident I can figure out how to 
> > hack that.
> 
> Maybe try something like this:
> 
> url <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975";
> h <- xml2::read_html(url)
> tbl <- rvest::html_table(h)

Dear Spencer,

I unified the party tables after the 
first summary table like this:

        url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975";
        M_sos <- RCurl::getURL(url)
        saveRDS(object=M_sos, file="dcp.rds")
        dat <- XML::readHTMLTable(M_sos)
        idx <- 2:length(dat)
        cn <- unique(unlist(lapply(dat[idx], colnames)))
        dat <- do.call(rbind,
          sapply(idx, function(i, dat, cn) {
            x <- dat[[i]]
            x[,cn[!(cn %in% colnames(x))]] <- NA
            x <- x[,cn]
            x$Party <- names(dat)[i]
            return(list(x))
          }, dat=dat, cn=cn))
        dat[,"Date Filed"] <-
          as.Date(x=dat[,"Date Filed"],
                  format="%m/%d/%Y")
        write.table(dat, file="dcp.tsv", sep="\t",
                    row.names=FALSE,
                    quote=TRUE, na="N/A") 

Best,
Rasmus

Attachment: signature.asc
Description: PGP signature

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to