Re: [R] [External] Re: help with web scraping
Dear William Michels, On 2020-07-25 10:58 -0700, William Michels wrote: > > Dear Spencer Graves (and Rasmus Liland), > > I've had some luck just using gsub() > to alter the offending "" > characters, appending a "___" tag at > each instance of "" (first I > checked the text to make sure it > didn't contain any pre-existing > instances of "___"). See the output > snippet below: > > > library(RCurl) > > library(XML) > > sosURL <- > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > > sosChars <- getURL(sosURL) > > sosChars2 <- gsub("", "___", sosChars) > > MOcan <- readHTMLTable(sosChars2) > > MOcan[[2]] > Name > 1 Raleigh Ritter > 2 Mike Parson > 3 James W. (Jim) Neely > 4 Saundra McDowell >Mailing Address > 1 4476 FIVE MILE RD___SENECA MO 64865 > 2 1458 E 464 RD___BOLIVAR MO 65613 > 3PO BOX 343___CAMERON MO 64429 > 4 3854 SOUTH AVENUE___SPRINGFIELD MO 65807 > Random Number Date Filed > 1 185 2/25/2020 > 2 348 2/25/2020 > 3 477 2/25/2020 > 43/31/2020 > > > > It's true, there's one a 'section' of > MOcan output that contains odd-looking > characters (see the "Total" line of > MOcan[[1]]). But my guess is you'll be > deleting this 'line' anyway--and > recalulating totals in R. Perhaps it's the this table you mean? Offices Republican 1 Governor 4 2 Lieutenant Governor 4 3Secretary of State 1 4 State Treasurer 1 5 Attorney General 1 6 U.S. Representative 24 7 State Senator 28 8 State Representative187 9 Circuit Judge 18 10Total 268\r\n___ Democratic LibertarianGreen 1 5 11 2 2 11 3 1 11 4 1 11 5 2 10 6 16 90 7 22 21 8 137 62 9 1 00 10 187\r\n___ 22\r\n___ 7\r\n___ Constitution Total 1 0 11 2 0 8 3 1 5 4 0 4 5 0 4 6 0 49 7 0 53 8 1333 9 0 19 10 2\r\n___ 486\r\n___ Yes, somehow the Windows[1] character "0xD" gets converted to "\r\n" after your gsub, "" is still ignored. There is not a "0xD" inside the td.AddressCol cells in the tables we are interested in. > Now that you have a comprehensive list > object, you should be able to pull out > districts/races of interest. You might > want to take a look at the "rlist" > package, to see if it can make your > work a little easier: > > https://CRAN.R-project.org/package=rlist > https://renkun-ken.github.io/rlist-tutorial/index.html Thank you, this package seems useful. Please can you provide a hint (maybe) as to which of the many functions you were thinking of? E.g. instead of using for over the index of the list of headers and tables, if typeof list or character, and updating variables to write in the political position to each table. V r [1] https://stackoverflow.com/questions/5843495/what-does-m-character-mean-in-vim signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
Dear GRAVES et al., On 2020-07-25 12:43 -0500, Spencer Graves wrote: > Dear Rasmus Liland et al.: > > On 2020-07-25 11:30, Rasmus Liland wrote: > > On 2020-07-25 09:56 -0500, Spencer Graves wrote: > > > Dear Rasmus et al.: > > > > It is LILAND et al., is it not? ... else it's customary to > > put a comma in there, isn't it? ... > > The APA Style recommends "Sharp et al., 2007": > > https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html If "Sharp et al., 2007" is an APA citation of this book[*], Sharp is John A Sharp's surname, Liland is my surname. Q.E.D. I have not used APA before (as I am not a Psychiatrist), as the minimalism of IEEE[**] always seemed more desirable. > Regarding Confucius, I'm confused. Nevermind, just fooling around, that's all. > > On 2020-07-25 04:10, Rasmus Liland wrote: > > > > > > However, this suppressed "" > > > everywhere.? > > > > Why is that, please explain. > > I don't know why the Missouri > Secretary of State's web site includes > "" to signal a new line, but it > does. Me neither! On top of that, is actually[***] an XHTML tag, not an HTML tag. > I also don't know why > XML::readHTMLTable suppressed "" > everywhere it occurred, but it did > that. Yes, I know, I also observed this. But now we swiftly solved this by gsubbig it with the newline char, "\n", which does not make sense for HTML parses anyway. > > > If you aren't aware of one, I can > > > gsub("", "\n", ...) on the string > > > for each political office before > > > passing it to "XML::readHTMLTable".? I > > > just tested this:? It works. > > > > Such a great hack! IMHO, this is much > > more flexible than using > > xml2::read_html, rvest::read_table, > > dplyr::mutate like here[1] > > > > [1] > > https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells > > And I added my solution to this > problem to this Stackoverflow thread. I wish you many upvotes, alas the political competition is obiously not tough there, as the other guy just got one down vote. [*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 [**] https://pitt.libguides.com/citationhelp/ieee [***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
Dear Spencer Graves (and Rasmus Liland), I've had some luck just using gsub() to alter the offending "" characters, appending a "___" tag at each instance of "" (first I checked the text to make sure it didn't contain any pre-existing instances of "___"). See the output snippet below: > library(RCurl) > library(XML) > sosURL <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > sosChars <- getURL(sosURL) > sosChars2 <- gsub("", "___", sosChars) > MOcan <- readHTMLTable(sosChars2) > MOcan[[2]] Name Mailing Address Random Number Date Filed 1 Raleigh Ritter 4476 FIVE MILE RD___SENECA MO 64865 185 2/25/2020 2 Mike Parson 1458 E 464 RD___BOLIVAR MO 65613 348 2/25/2020 3 James W. (Jim) NeelyPO BOX 343___CAMERON MO 64429 477 2/25/2020 4 Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807 3/31/2020 > It's true, there's one a 'section' of MOcan output that contains odd-looking characters (see the "Total" line of MOcan[[1]]). But my guess is you'll be deleting this 'line' anyway--and recalulating totals in R. Now that you have a comprehensive list object, you should be able to pull out districts/races of interest. You might want to take a look at the "rlist" package, to see if it can make your work a little easier: https://CRAN.R-project.org/package=rlist https://renkun-ken.github.io/rlist-tutorial/index.html HTH, Bill. W. Michels, Ph.D. On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves wrote: > > Dear Rasmus et al.: > > > On 2020-07-25 04:10, Rasmus Liland wrote: > > On 2020-07-24 10:28 -0500, Spencer Graves wrote: > >> Dear Rasmus: > >> > >>> Dear Spencer, > >>> > >>> I unified the party tables after the > >>> first summary table like this: > >>> > >>> url <- > >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > >>> M_sos <- RCurl::getURL(url) > >>> saveRDS(object=M_sos, file="dcp.rds") > >>> dat <- XML::readHTMLTable(M_sos) > >>> idx <- 2:length(dat) > >>> cn <- unique(unlist(lapply(dat[idx], colnames))) > >> This is useful for this application. > >> > >>> dat <- do.call(rbind, > >>> sapply(idx, function(i, dat, cn) { > >>> x <- dat[[i]] > >>> x[,cn[!(cn %in% colnames(x))]] <- NA > >>> x <- x[,cn] > >>> x$Party <- names(dat)[i] > >>> return(list(x)) > >>> }, dat=dat, cn=cn)) > >>> dat[,"Date Filed"] <- > >>> as.Date(x=dat[,"Date Filed"], > >>> format="%m/%d/%Y") > >> This misses something extremely > >> important for this application:? The > >> political office.? That's buried in > >> the HTML or whatever it is.? I'm using > >> something like the following to find > >> that: > >> > >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > > Dear Spencer, > > > > I came up with a solution, but it is not > > very elegant. Instead of showing you > > the solution, hoping you understand > > everything in it, I istead want to give > > you some emphatic hints to see if you > > can come up with a solution on you own. > > > > - XML::htmlTreeParse(M_sos) > >- *Gandalf voice*: climb the tree > > until you find the content you are > > looking for flat out at the level of > > «The Children of the Div», *uuuUUU* > >- you only want to keep the table and > > header tags at this level > > - Use XML::xmlValue to extract the > >values of all the headers (the > >political positions) > > - Observe that all the tables on the > >page you were able to extract > >previously using XML::readHTMLTable, > >are at this level, shuffled between > >the political position header tags, > >this means you extract the political > >position and party affiliation by > >using a for loop, if statements, > >typeof, names, and [] and [[]] to grab > >different things from the list > >(content or the bag itself). > >XML::readHTMLTable strips away the > >line break tags from the Mailing > >address, so if you find a better way > >of extracting the tables, tell me, > >e.g. you get > > > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > > >and not > > > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > > > When you've completed this «programming > > quest», you're back at the level of the > > previous email, i.e. you have have the > > same tables, but with political position > > and party affiliation added to them. > > >Please excuse: Before my last post, I had written code to do all > that. In brief, the political offices are "h3" tags. I used "strsplit" > to split the string at "". I then wrote a function to find "", > extract the political office and pass the rest to "XML::readHTMLTable", > adding columns for party and political office. > > >However, this suppressed "" everywhere. I thought there > should be an
Re: [R] [External] Re: help with web scraping
Dear Rasmus Liland et al.: On 2020-07-25 11:30, Rasmus Liland wrote: On 2020-07-25 09:56 -0500, Spencer Graves wrote: Dear Rasmus et al.: It is LILAND et al., is it not? ... else it's customary to put a comma in there, isn't it? ... The APA Style recommends "Sharp et al., 2007": https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html Regarding Confucius, I'm confused. right, moving on: On 2020-07-25 04:10, Rasmus Liland wrote: Please research using Thunderbird, Claws mail, or some other sane e-mail client; they are great, I promise. Thanks. I researched it and turned of HTML. Please excuse: I noticed it was a problem, but hadn't prioritized time to research and fix it until your comment. Thanks. Please excuse:? Before my last post, I had written code to do all that.? Good! In brief, the political offices are "h3" tags.? Yes, some type of header element at least, in-between the various tables, everything children of the div in the element tree. I used "strsplit" to split the string at "".? I then wrote a function to find "", extract the political office and pass the rest to "XML::readHTMLTable", adding columns for party and political office. Yes, doing that for the political office is also possible, but the party is inside the table's caption tag, which end up as the name of the table in the XML::readHTMLTable list ... However, this suppressed "" everywhere.? Why is that, please explain. I don't know why the Missouri Secretary of State's web site includes "" to signal a new line, but it does. I also don't know why XML::readHTMLTable suppressed "" everywhere it occurred, but it did that. After I used gsub to replace "" with "\n", I found that XML::readHTMLTable did not replace "\n", so I got what I wanted. I thought there should be an option with something like "XML::readHTMLTable" that would not delete "" everywhere, but I couldn't find it.? No, there is not, AFAIK. Please, if anyone else knows, please say so *echoes in the forest* If you aren't aware of one, I can gsub("", "\n", ...) on the string for each political office before passing it to "XML::readHTMLTable".? I just tested this:? It works. Such a great hack! IMHO, this is much more flexible than using xml2::read_html, rvest::read_table, dplyr::mutate like here[1] I have other web scraping problems in my work plan for the few days.? Maybe, idk ... I will definitely try XML::htmlTreeParse, etc., as you suggest. I wish you good luck, Rasmus [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells And I added my solution to this problem to this Stackoverflow thread. Thanks again, Spencer __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
On 2020-07-25 09:56 -0500, Spencer Graves wrote: > Dear Rasmus et al.: It is LILAND et al., is it not? I do not belong to a large Confucian family structure (putting the hunter-gatherer horse-rider tribe name first in all-caps in the email), else it's customary to put a comma in there, isn't it? ... right, moving on: On 2020-07-25 04:10, Rasmus Liland wrote: > > ? It might be a better idea to write the reply in plain-text utf-8 or at least Western or Eastern-European ISO euro encoding instead of us-ascii (maybe KOI8, ¯\_(ツ)_/¯) ... something in your email got string-replaced by "?" and also "«" got replaced by "?". Please research using Thunderbird, Claws mail, or some other sane e-mail client; they are great, I promise. > Please excuse:? Before my last post, I > had written code to do all that.? Good! > In brief, the political offices are > "h3" tags.? Yes, some type of header element at least, in-between the various tables, everything children of the div in the element tree. > I used "strsplit" to split the string > at "".? I then wrote a > function to find "", extract the > political office and pass the rest to > "XML::readHTMLTable", adding columns > for party and political office. Yes, doing that for the political office is also possible, but the party is inside the table's caption tag, which end up as the name of the table in the XML::readHTMLTable list ... > However, this suppressed "" > everywhere.? Why is that, please explain. > I thought there should be > an option with something like > "XML::readHTMLTable" that would not > delete "" everywhere, but I > couldn't find it.? No, there is not, AFAIK. Please, if anyone else knows, please say so *echoes in the forest* > If you aren't aware of one, I can > gsub("", "\n", ...) on the string > for each political office before > passing it to "XML::readHTMLTable".? I > just tested this:? It works. Such a great hack! IMHO, this is much more flexible than using xml2::read_html, rvest::read_table, dplyr::mutate like here[1] > I have other web scraping problems in > my work plan for the few days.? Maybe, idk ... > I will definitely try > XML::htmlTreeParse, etc., as you > suggest. I wish you good luck, Rasmus [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
Dear Rasmus et al.: On 2020-07-25 04:10, Rasmus Liland wrote: > On 2020-07-24 10:28 -0500, Spencer Graves wrote: >> Dear Rasmus: >> >>> Dear Spencer, >>> >>> I unified the party tables after the >>> first summary table like this: >>> >>> url <- >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; >>> M_sos <- RCurl::getURL(url) >>> saveRDS(object=M_sos, file="dcp.rds") >>> dat <- XML::readHTMLTable(M_sos) >>> idx <- 2:length(dat) >>> cn <- unique(unlist(lapply(dat[idx], colnames))) >> This is useful for this application. >> >>> dat <- do.call(rbind, >>> sapply(idx, function(i, dat, cn) { >>> x <- dat[[i]] >>> x[,cn[!(cn %in% colnames(x))]] <- NA >>> x <- x[,cn] >>> x$Party <- names(dat)[i] >>> return(list(x)) >>> }, dat=dat, cn=cn)) >>> dat[,"Date Filed"] <- >>> as.Date(x=dat[,"Date Filed"], >>> format="%m/%d/%Y") >> This misses something extremely >> important for this application:? The >> political office.? That's buried in >> the HTML or whatever it is.? I'm using >> something like the following to find >> that: >> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > Dear Spencer, > > I came up with a solution, but it is not > very elegant. Instead of showing you > the solution, hoping you understand > everything in it, I istead want to give > you some emphatic hints to see if you > can come up with a solution on you own. > > - XML::htmlTreeParse(M_sos) >- *Gandalf voice*: climb the tree > until you find the content you are > looking for flat out at the level of > �The Children of the Div�, *uuuUUU* >- you only want to keep the table and > header tags at this level > - Use XML::xmlValue to extract the >values of all the headers (the >political positions) > - Observe that all the tables on the >page you were able to extract >previously using XML::readHTMLTable, >are at this level, shuffled between >the political position header tags, >this means you extract the political >position and party affiliation by >using a for loop, if statements, >typeof, names, and [] and [[]] to grab >different things from the list >(content or the bag itself). >XML::readHTMLTable strips away the >line break tags from the Mailing >address, so if you find a better way >of extracting the tables, tell me, >e.g. you get > > 8805 HUNTER AVEKANSAS CITY MO 64138 > >and not > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > When you've completed this �programming > quest�, you're back at the level of the > previous email, i.e. you have have the > same tables, but with political position > and party affiliation added to them. � Please excuse:� Before my last post, I had written code to do all that.� In brief, the political offices are "h3" tags.� I used "strsplit" to split the string at "".� I then wrote a function to find "", extract the political office and pass the rest to "XML::readHTMLTable", adding columns for party and political office. � However, this suppressed "" everywhere.� I thought there should be an option with something like "XML::readHTMLTable" that would not delete "" everywhere, but I couldn't find it.� If you aren't aware of one, I can gsub("", "\n", ...) on the string for each political office before passing it to "XML::readHTMLTable".� I just tested this:� It works. � I have other web scraping problems in my work plan for the few days.� I will definitely try XML::htmlTreeParse, etc., as you suggest. � Thanks again. � Spencer Graves > > Best, > Rasmus > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
On 2020-07-24 10:28 -0500, Spencer Graves wrote: > Dear Rasmus: > > > Dear Spencer, > > > > I unified the party tables after the > > first summary table like this: > > > > url <- > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > > M_sos <- RCurl::getURL(url) > > saveRDS(object=M_sos, file="dcp.rds") > > dat <- XML::readHTMLTable(M_sos) > > idx <- 2:length(dat) > > cn <- unique(unlist(lapply(dat[idx], colnames))) > > This is useful for this application. > > > dat <- do.call(rbind, > > sapply(idx, function(i, dat, cn) { > > x <- dat[[i]] > > x[,cn[!(cn %in% colnames(x))]] <- NA > > x <- x[,cn] > > x$Party <- names(dat)[i] > > return(list(x)) > > }, dat=dat, cn=cn)) > > dat[,"Date Filed"] <- > > as.Date(x=dat[,"Date Filed"], > > format="%m/%d/%Y") > > This misses something extremely > important for this application:? The > political office.? That's buried in > the HTML or whatever it is.? I'm using > something like the following to find > that: > > str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) Dear Spencer, I came up with a solution, but it is not very elegant. Instead of showing you the solution, hoping you understand everything in it, I istead want to give you some emphatic hints to see if you can come up with a solution on you own. - XML::htmlTreeParse(M_sos) - *Gandalf voice*: climb the tree until you find the content you are looking for flat out at the level of «The Children of the Div», *uuuUUU* - you only want to keep the table and header tags at this level - Use XML::xmlValue to extract the values of all the headers (the political positions) - Observe that all the tables on the page you were able to extract previously using XML::readHTMLTable, are at this level, shuffled between the political position header tags, this means you extract the political position and party affiliation by using a for loop, if statements, typeof, names, and [] and [[]] to grab different things from the list (content or the bag itself). XML::readHTMLTable strips away the line break tags from the Mailing address, so if you find a better way of extracting the tables, tell me, e.g. you get 8805 HUNTER AVEKANSAS CITY MO 64138 and not 8805 HUNTER AVEKANSAS CITY MO 64138 When you've completed this «programming quest», you're back at the level of the previous email, i.e. you have have the same tables, but with political position and party affiliation added to them. Best, Rasmus signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
Dear Rasmus: On 2020-07-24 09:16, Rasmus Liland wrote: > On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote: >> On Fri, 24 Jul 2020, Spencer Graves wrote: >>> On 2020-07-23 17:46, William Michels wrote: On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves wrote: > Hello, All: > > I've failed with multiple > attempts to scrape the table of > candidates from the website of > the Missouri Secretary of > State: > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 Hi Spencer, I tried the code below on an older R-installation, and it works fine. Not a full solution, but it's a start: > library(RCurl) Loading required package: bitops > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > M_sos <- getURL(url) >>> Hi Bill et al.: >>> >>> That broke the dam:� It gave me a >>> character vector of length 1 >>> consisting of 218 KB.� I fed that to >>> XML::readHTMLTable and >>> purrr::map_chr, both of which >>> returned lists of 337 data.frames. >>> The former retained names for all >>> the tables, absent from the latter. >>> The columns of the former are all >>> character;� that's not true for the >>> latter. >>> >>> Sadly, it's not quite what I want: >>> It's one table for each office-party >>> combination, but it's lost the >>> office designation. However, I'm >>> confident I can figure out how to >>> hack that. >> Maybe try something like this: >> >> url <- >> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; >> h <- xml2::read_html(url) >> tbl <- rvest::html_table(h) > Dear Spencer, > > I unified the party tables after the > first summary table like this: > > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > M_sos <- RCurl::getURL(url) > saveRDS(object=M_sos, file="dcp.rds") > dat <- XML::readHTMLTable(M_sos) > idx <- 2:length(dat) > cn <- unique(unlist(lapply(dat[idx], colnames))) � This is useful for this application. > dat <- do.call(rbind, > sapply(idx, function(i, dat, cn) { > x <- dat[[i]] > x[,cn[!(cn %in% colnames(x))]] <- NA > x <- x[,cn] > x$Party <- names(dat)[i] > return(list(x)) > }, dat=dat, cn=cn)) > dat[,"Date Filed"] <- > as.Date(x=dat[,"Date Filed"], > format="%m/%d/%Y") � This misses something extremely important for this application:� The political office.� That's buried in the HTML or whatever it is.� I'm using something like the following to find that: str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) � After I figure this out, I will use something like your code to combine it all into separate tables for each office, and then probably combine those into one table for the offices I'm interested in.� For my present purposes, I don't want all the offices in Missouri, only the executive positions and those representing parts of the Kansas City metro area in the Missouri legislature. � Thanks again, � Spencer Graves > write.table(dat, file="dcp.tsv", sep="\t", > row.names=FALSE, > quote=TRUE, na="N/A") > > Best, > Rasmus > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote: > On Fri, 24 Jul 2020, Spencer Graves wrote: > > On 2020-07-23 17:46, William Michels wrote: > > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves > > > wrote: > > > > Hello, All: > > > > > > > > I've failed with multiple > > > > attempts to scrape the table of > > > > candidates from the website of > > > > the Missouri Secretary of > > > > State: > > > > > > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 > > > > > > Hi Spencer, > > > > > > I tried the code below on an older > > > R-installation, and it works fine. > > > Not a full solution, but it's a > > > start: > > > > > > > library(RCurl) > > > Loading required package: bitops > > > > url <- > > > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > > > > M_sos <- getURL(url) > > > > Hi Bill et al.: > > > > That broke the dam: It gave me a > > character vector of length 1 > > consisting of 218 KB. I fed that to > > XML::readHTMLTable and > > purrr::map_chr, both of which > > returned lists of 337 data.frames. > > The former retained names for all > > the tables, absent from the latter. > > The columns of the former are all > > character; that's not true for the > > latter. > > > > Sadly, it's not quite what I want: > > It's one table for each office-party > > combination, but it's lost the > > office designation. However, I'm > > confident I can figure out how to > > hack that. > > Maybe try something like this: > > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; > h <- xml2::read_html(url) > tbl <- rvest::html_table(h) Dear Spencer, I unified the party tables after the first summary table like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; M_sos <- RCurl::getURL(url) saveRDS(object=M_sos, file="dcp.rds") dat <- XML::readHTMLTable(M_sos) idx <- 2:length(dat) cn <- unique(unlist(lapply(dat[idx], colnames))) dat <- do.call(rbind, sapply(idx, function(i, dat, cn) { x <- dat[[i]] x[,cn[!(cn %in% colnames(x))]] <- NA x <- x[,cn] x$Party <- names(dat)[i] return(list(x)) }, dat=dat, cn=cn)) dat[,"Date Filed"] <- as.Date(x=dat[,"Date Filed"], format="%m/%d/%Y") write.table(dat, file="dcp.tsv", sep="\t", row.names=FALSE, quote=TRUE, na="N/A") Best, Rasmus signature.asc Description: PGP signature __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
On 2020-07-24 08:20, luke-tier...@uiowa.edu wrote: Maybe try something like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; h <- xml2::read_html(url) Error in open.connection(x, "rb") : HTTP error 404. Thanks for the suggestion, but this failed for me on the platform described in "sessionInfo" below. tbl <- rvest::html_table(h) As I previously noted, RCurl::getURL returned a single character string of roughly 218 KB, from which I've so far gotten most but not all of what I want. Unfortunately, when I fed that character vector to rvest::html_table, I got: Error in UseMethod("html_table") : no applicable method for 'html_table' applied to an object of class "character" I don't know for sure yet, but I believe I'll be able to get what I want from the single character string using, e.g., gregexpr and other functions. Thanks again, Spencer Graves Best, luke On Fri, 24 Jul 2020, Spencer Graves wrote: Hi Bill et al.: That broke the dam: It gave me a character vector of length 1 consisting of 218 KB. I fed that to XML::readHTMLTable and purrr::map_chr, both of which returned lists of 337 data.frames. The former retained names for all the tables, absent from the latter. The columns of the former are all character; that's not true for the latter. Sadly, it's not quite what I want: It's one table for each office-party combination, but it's lost the office designation. However, I'm confident I can figure out how to hack that. Thanks, Spencer Graves On 2020-07-23 17:46, William Michels wrote: Hi Spencer, I tried the code below on an older R-installation, and it works fine. Not a full solution, but it's a start: library(RCurl) Loading required package: bitops url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; M_sos <- getURL(url) print(M_sos) [1] "\r\n\r\n\r\n\r\n\r\n\tSOS, Missouri - Elections: Offices Filed in Candidate Filing\r\n wrote: Hello, All: I've failed with multiple attempts to scrape the table of candidates from the website of the Missouri Secretary of State: https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 I've tried base::url, base::readLines, xml2::read_html, and XML::readHTMLTable; see summary below. Suggestions? Thanks, Spencer Graves sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; str(baseURL <- base::url(sosURL)) # this might give me something, but I don't know what sosRead <- base::readLines(sosURL) # 404 Not Found sosRb <- base::readLines(baseURL) # 404 Not Found sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. sosXML <- XML::readHTMLTable(sosURL) # List of 0; does not seem to be XML sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base loaded via a namespace (and not attached): [1] compiler_4.0.2 tools_4.0.2 curl_4.3 [4] xml2_1.3.2 XML_3.99-0.3 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [External] Re: help with web scraping
Maybe try something like this: url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; h <- xml2::read_html(url) tbl <- rvest::html_table(h) Best, luke On Fri, 24 Jul 2020, Spencer Graves wrote: Hi Bill et al.: That broke the dam: It gave me a character vector of length 1 consisting of 218 KB. I fed that to XML::readHTMLTable and purrr::map_chr, both of which returned lists of 337 data.frames. The former retained names for all the tables, absent from the latter. The columns of the former are all character; that's not true for the latter. Sadly, it's not quite what I want: It's one table for each office-party combination, but it's lost the office designation. However, I'm confident I can figure out how to hack that. Thanks, Spencer Graves On 2020-07-23 17:46, William Michels wrote: Hi Spencer, I tried the code below on an older R-installation, and it works fine. Not a full solution, but it's a start: library(RCurl) Loading required package: bitops url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; M_sos <- getURL(url) print(M_sos) [1] "\r\n\r\n\r\n\r\n\r\n\tSOS, Missouri - Elections: Offices Filed in Candidate Filing\r\n wrote: Hello, All: I've failed with multiple attempts to scrape the table of candidates from the website of the Missouri Secretary of State: https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 I've tried base::url, base::readLines, xml2::read_html, and XML::readHTMLTable; see summary below. Suggestions? Thanks, Spencer Graves sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; str(baseURL <- base::url(sosURL)) # this might give me something, but I don't know what sosRead <- base::readLines(sosURL) # 404 Not Found sosRb <- base::readLines(baseURL) # 404 Not Found sosXml2 <- xml2::read_html(sosURL) # HTTP error 404. sosXML <- XML::readHTMLTable(sosURL) # List of 0; does not seem to be XML sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets [6] methods base loaded via a namespace (and not attached): [1] compiler_4.0.2 tools_4.0.2curl_4.3 [4] xml2_1.3.2 XML_3.99-0.3 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.