Dear Spencer Graves (and Rasmus Liland), I've had some luck just using gsub() to alter the offending "</br>" characters, appending a "___" tag at each instance of "<br>" (first I checked the text to make sure it didn't contain any pre-existing instances of "___"). See the output snippet below:
> library(RCurl) > library(XML) > sosURL <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > sosChars <- getURL(sosURL) > sosChars2 <- gsub("<br/>", "<br/>___", sosChars) > MOcan <- readHTMLTable(sosChars2) > MOcan[[2]] Name Mailing Address Random Number Date Filed 1 Raleigh Ritter 4476 FIVE MILE RD___SENECA MO 64865 185 2/25/2020 2 Mike Parson 1458 E 464 RD___BOLIVAR MO 65613 348 2/25/2020 3 James W. (Jim) Neely PO BOX 343___CAMERON MO 64429 477 2/25/2020 4 Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807 3/31/2020 > It's true, there's one a 'section' of MOcan output that contains odd-looking characters (see the "Total" line of MOcan[[1]]). But my guess is you'll be deleting this 'line' anyway--and recalulating totals in R. Now that you have a comprehensive list object, you should be able to pull out districts/races of interest. You might want to take a look at the "rlist" package, to see if it can make your work a little easier: https://CRAN.R-project.org/package=rlist https://renkun-ken.github.io/rlist-tutorial/index.html HTH, Bill. W. Michels, Ph.D. On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves <spencer.gra...@effectivedefense.org> wrote: > > Dear Rasmus et al.: > > > On 2020-07-25 04:10, Rasmus Liland wrote: > > On 2020-07-24 10:28 -0500, Spencer Graves wrote: > >> Dear Rasmus: > >> > >>> Dear Spencer, > >>> > >>> I unified the party tables after the > >>> first summary table like this: > >>> > >>> url <- > >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > >>> M_sos <- RCurl::getURL(url) > >>> saveRDS(object=M_sos, file="dcp.rds") > >>> dat <- XML::readHTMLTable(M_sos) > >>> idx <- 2:length(dat) > >>> cn <- unique(unlist(lapply(dat[idx], colnames))) > >> This is useful for this application. > >> > >>> dat <- do.call(rbind, > >>> sapply(idx, function(i, dat, cn) { > >>> x <- dat[[i]] > >>> x[,cn[!(cn %in% colnames(x))]] <- NA > >>> x <- x[,cn] > >>> x$Party <- names(dat)[i] > >>> return(list(x)) > >>> }, dat=dat, cn=cn)) > >>> dat[,"Date Filed"] <- > >>> as.Date(x=dat[,"Date Filed"], > >>> format="%m/%d/%Y") > >> This misses something extremely > >> important for this application:? The > >> political office.? That's buried in > >> the HTML or whatever it is.? I'm using > >> something like the following to find > >> that: > >> > >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > > Dear Spencer, > > > > I came up with a solution, but it is not > > very elegant. Instead of showing you > > the solution, hoping you understand > > everything in it, I istead want to give > > you some emphatic hints to see if you > > can come up with a solution on you own. > > > > - XML::htmlTreeParse(M_sos) > > - *Gandalf voice*: climb the tree > > until you find the content you are > > looking for flat out at the level of > > «The Children of the Div», *uuuUUU* > > - you only want to keep the table and > > header tags at this level > > - Use XML::xmlValue to extract the > > values of all the headers (the > > political positions) > > - Observe that all the tables on the > > page you were able to extract > > previously using XML::readHTMLTable, > > are at this level, shuffled between > > the political position header tags, > > this means you extract the political > > position and party affiliation by > > using a for loop, if statements, > > typeof, names, and [] and [[]] to grab > > different things from the list > > (content or the bag itself). > > XML::readHTMLTable strips away the > > line break tags from the Mailing > > address, so if you find a better way > > of extracting the tables, tell me, > > e.g. you get > > > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > > > and not > > > > 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 > > > > When you've completed this «programming > > quest», you're back at the level of the > > previous email, i.e. you have have the > > same tables, but with political position > > and party affiliation added to them. > > > Please excuse: Before my last post, I had written code to do all > that. In brief, the political offices are "h3" tags. I used "strsplit" > to split the string at "<h3>". I then wrote a function to find "</h3>", > extract the political office and pass the rest to "XML::readHTMLTable", > adding columns for party and political office. > > > However, this suppressed "<br/>" everywhere. I thought there > should be an option with something like "XML::readHTMLTable" that would > not delete "<br/>" everywhere, but I couldn't find it. If you aren't > aware of one, I can gsub("<br/>", "\n", ...) on the string for each > political office before passing it to "XML::readHTMLTable". I just > tested this: It works. > > > I have other web scraping problems in my work plan for the few > days. I will definitely try XML::htmlTreeParse, etc., as you suggest. > > > Thanks again. > Spencer Graves > > > > Best, > > Rasmus > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.