On 2020-07-24 10:28 -0500, Spencer Graves wrote: > Dear Rasmus: > > > Dear Spencer, > > > > I unified the party tables after the > > first summary table like this: > > > > url <- > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > > M_sos <- RCurl::getURL(url) > > saveRDS(object=M_sos, file="dcp.rds") > > dat <- XML::readHTMLTable(M_sos) > > idx <- 2:length(dat) > > cn <- unique(unlist(lapply(dat[idx], colnames))) > > This is useful for this application. > > > dat <- do.call(rbind, > > sapply(idx, function(i, dat, cn) { > > x <- dat[[i]] > > x[,cn[!(cn %in% colnames(x))]] <- NA > > x <- x[,cn] > > x$Party <- names(dat)[i] > > return(list(x)) > > }, dat=dat, cn=cn)) > > dat[,"Date Filed"] <- > > as.Date(x=dat[,"Date Filed"], > > format="%m/%d/%Y") > > This misses something extremely > important for this application:? The > political office.? That's buried in > the HTML or whatever it is.? I'm using > something like the following to find > that: > > str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
Dear Spencer, I came up with a solution, but it is not very elegant. Instead of showing you the solution, hoping you understand everything in it, I istead want to give you some emphatic hints to see if you can come up with a solution on you own. - XML::htmlTreeParse(M_sos) - *Gandalf voice*: climb the tree until you find the content you are looking for flat out at the level of «The Children of the Div», *uuuUUU* - you only want to keep the table and header tags at this level - Use XML::xmlValue to extract the values of all the headers (the political positions) - Observe that all the tables on the page you were able to extract previously using XML::readHTMLTable, are at this level, shuffled between the political position header tags, this means you extract the political position and party affiliation by using a for loop, if statements, typeof, names, and [] and [[]] to grab different things from the list (content or the bag itself). XML::readHTMLTable strips away the line break tags from the Mailing address, so if you find a better way of extracting the tables, tell me, e.g. you get 8805 HUNTER AVEKANSAS CITY MO 64138 and not 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 When you've completed this «programming quest», you're back at the level of the previous email, i.e. you have have the same tables, but with political position and party affiliation added to them. Best, Rasmus
signature.asc
Description: PGP signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.