subject:"Re\: \[R\] \[External\] Re\: help with web scraping"

Re: [R] [External] Re: help with web scraping

2020-07-26 Thread Rasmus Liland

Dear William Michels,

On 2020-07-25 10:58 -0700, William Michels wrote:
> 
> Dear Spencer Graves (and Rasmus Liland),
> 
> I've had some luck just using gsub() 
> to alter the offending "" 
> characters, appending a "___" tag at 
> each instance of "" (first I 
> checked the text to make sure it 
> didn't contain any pre-existing 
> instances of "___"). See the output 
> snippet below:
> 
> > library(RCurl)
> > library(XML)
> > sosURL <- 
> > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> > sosChars <- getURL(sosURL)
> > sosChars2 <- gsub("", "___", sosChars)
> > MOcan <- readHTMLTable(sosChars2)
> > MOcan[[2]]
>   Name
> 1   Raleigh Ritter
> 2  Mike Parson
> 3 James W. (Jim) Neely
> 4 Saundra McDowell
>Mailing Address
> 1  4476 FIVE MILE RD___SENECA MO 64865
> 2 1458 E 464 RD___BOLIVAR MO 65613
> 3PO BOX 343___CAMERON MO 64429
> 4 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
>   Random Number Date Filed
> 1   185  2/25/2020
> 2   348  2/25/2020
> 3   477  2/25/2020
> 43/31/2020
> >
> 
> It's true, there's one a 'section' of 
> MOcan output that contains odd-looking 
> characters (see the "Total" line of 
> MOcan[[1]]). But my guess is you'll be 
> deleting this 'line' anyway--and 
> recalulating totals in R.

Perhaps it's the this table you mean?  

Offices Republican
1  Governor  4
2   Lieutenant Governor  4
3Secretary of State  1
4   State Treasurer  1
5  Attorney General  1
6   U.S. Representative 24
7 State Senator 28
8  State Representative187
9 Circuit Judge 18
10Total 268\r\n___
   Democratic LibertarianGreen
1   5   11
2   2   11
3   1   11
4   1   11
5   2   10
6  16   90
7  22   21
8 137   62
9   1   00
10 187\r\n___   22\r\n___ 7\r\n___
   Constitution  Total
1 0 11
2 0  8
3 1  5
4 0  4
5 0  4
6 0 49
7 0 53
8 1333
9 0 19
10 2\r\n___ 486\r\n___

Yes, somehow the Windows[1] character 
"0xD" gets converted to "\r\n" after 
your gsub, "" is still ignored.  

There is not a "0xD" inside the 
td.AddressCol cells in the tables we are 
interested in.

> Now that you have a comprehensive list 
> object, you should be able to pull out 
> districts/races of interest. You might 
> want to take a look at the "rlist" 
> package, to see if it can make your 
> work a little easier:
> 
> https://CRAN.R-project.org/package=rlist
> https://renkun-ken.github.io/rlist-tutorial/index.html

Thank you, this package seems useful.  

Please can you provide a hint (maybe) as 
to which of the many functions you were 
thinking of?  E.g. instead of using for 
over the index of the list of headers 
and tables, if typeof list or character, 
and updating variables to write in the 
political position to each table. 

V

r

[1] 
https://stackoverflow.com/questions/5843495/what-does-m-character-mean-in-vim


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-26 Thread Rasmus Liland

Dear GRAVES et al.,

On 2020-07-25 12:43 -0500, Spencer Graves wrote:
> Dear Rasmus Liland et al.:
> 
> On 2020-07-25 11:30, Rasmus Liland wrote:
> > On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> > > Dear Rasmus et al.:
> > 
> > It is LILAND et al., is it not?  ... else it's customary to
> > put a comma in there, isn't it? ...
> 
> The APA Style recommends "Sharp et al., 2007":
> 
> https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html

If "Sharp et al., 2007" is an APA 
citation of this book[*], Sharp is John A 
Sharp's surname, Liland is my surname.  
Q.E.D.

I have not used APA before (as I am not 
a Psychiatrist), as the minimalism of 
IEEE[**] always seemed more desirable.  

> Regarding Confucius, I'm confused.

Nevermind, just fooling around, that's 
all.

> > On 2020-07-25 04:10, Rasmus Liland wrote:
> > > 
> > > However, this suppressed ""
> > > everywhere.?
> > 
> > Why is that, please explain.
> 
> I don't know why the Missouri 
> Secretary of State's web site includes 
> "" to signal a new line, but it 
> does.

Me neither!  On top of that,  is 
actually[***] an XHTML tag, not an HTML 
tag.

> I also don't know why 
> XML::readHTMLTable suppressed "" 
> everywhere it occurred, but it did 
> that.

Yes, I know, I also observed this.  But 
now we swiftly solved this by gsubbig it 
with the newline char, "\n", which does 
not make sense for HTML parses anyway. 

> > > If you aren't aware of one, I can
> > > gsub("", "\n", ...) on the string
> > > for each political office before
> > > passing it to "XML::readHTMLTable".? I
> > > just tested this:? It works.
> > 
> > Such a great hack!  IMHO, this is much
> > more flexible than using
> > xml2::read_html, rvest::read_table,
> > dplyr::mutate like here[1]
> > 
> > [1] 
> > https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
> 
> And I added my solution to this 
> problem to this Stackoverflow thread.

I wish you many upvotes, alas the 
political competition is obiously not 
tough there, as the other guy just got 
one down vote.

[*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 
[**] https://pitt.libguides.com/citationhelp/ieee
[***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-25 Thread William Michels via R-help

Dear Spencer Graves (and Rasmus Liland),

I've had some luck just using gsub() to alter the offending ""
characters, appending a "___" tag at each instance of "" (first I
checked the text to make sure it didn't contain any pre-existing
instances of "___"). See the output snippet below:

> library(RCurl)
> library(XML)
> sosURL <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> sosChars <- getURL(sosURL)
> sosChars2 <- gsub("", "___", sosChars)
> MOcan <- readHTMLTable(sosChars2)
> MOcan[[2]]
  Name  Mailing Address Random
Number Date Filed
1   Raleigh Ritter  4476 FIVE MILE RD___SENECA MO 64865
   185  2/25/2020
2  Mike Parson 1458 E 464 RD___BOLIVAR MO 65613
   348  2/25/2020
3 James W. (Jim) NeelyPO BOX 343___CAMERON MO 64429
   477  2/25/2020
4 Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
3/31/2020
>

It's true, there's one a 'section' of MOcan output that contains
odd-looking characters (see the "Total" line of MOcan[[1]]). But my
guess is you'll be deleting this 'line' anyway--and recalulating totals in R.

Now that you have a comprehensive list object, you should be able to
pull out districts/races of interest. You might want to take a look at
the "rlist" package, to see if it can make your work a little easier:

https://CRAN.R-project.org/package=rlist
https://renkun-ken.github.io/rlist-tutorial/index.html

HTH, Bill.

W. Michels, Ph.D.









On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves
 wrote:
>
> Dear Rasmus et al.:
>
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
> > On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> >> Dear Rasmus:
> >>
> >>> Dear Spencer,
> >>>
> >>> I unified the party tables after the
> >>> first summary table like this:
> >>>
> >>> url <- 
> >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> >>> M_sos <- RCurl::getURL(url)
> >>> saveRDS(object=M_sos, file="dcp.rds")
> >>> dat <- XML::readHTMLTable(M_sos)
> >>> idx <- 2:length(dat)
> >>> cn <- unique(unlist(lapply(dat[idx], colnames)))
> >> This is useful for this application.
> >>
> >>> dat <- do.call(rbind,
> >>>   sapply(idx, function(i, dat, cn) {
> >>> x <- dat[[i]]
> >>> x[,cn[!(cn %in% colnames(x))]] <- NA
> >>> x <- x[,cn]
> >>> x$Party <- names(dat)[i]
> >>> return(list(x))
> >>>   }, dat=dat, cn=cn))
> >>> dat[,"Date Filed"] <-
> >>>   as.Date(x=dat[,"Date Filed"],
> >>>   format="%m/%d/%Y")
> >> This misses something extremely
> >> important for this application:?  The
> >> political office.? That's buried in
> >> the HTML or whatever it is.? I'm using
> >> something like the following to find
> >> that:
> >>
> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> > Dear Spencer,
> >
> > I came up with a solution, but it is not
> > very elegant.  Instead of showing you
> > the solution, hoping you understand
> > everything in it, I istead want to give
> > you some emphatic hints to see if you
> > can come up with a solution on you own.
> >
> > - XML::htmlTreeParse(M_sos)
> >- *Gandalf voice*: climb the tree
> >  until you find the content you are
> >  looking for flat out at the level of
> >  «The Children of the Div», *uuuUUU*
> >- you only want to keep the table and
> >  header tags at this level
> > - Use XML::xmlValue to extract the
> >values of all the headers (the
> >political positions)
> > - Observe that all the tables on the
> >page you were able to extract
> >previously using XML::readHTMLTable,
> >are at this level, shuffled between
> >the political position header tags,
> >this means you extract the political
> >position and party affiliation by
> >using a for loop, if statements,
> >typeof, names, and [] and [[]] to grab
> >different things from the list
> >(content or the bag itself).
> >XML::readHTMLTable strips away the
> >line break tags from the Mailing
> >address, so if you find a better way
> >of extracting the tables, tell me,
> >e.g. you get
> >
> >   8805 HUNTER AVEKANSAS CITY MO 64138
> >
> >and not
> >
> >   8805 HUNTER AVEKANSAS CITY MO 64138
> >
> > When you've completed this «programming
> > quest», you're back at the level of the
> > previous email, i.e.  you have have the
> > same tables, but with political position
> > and party affiliation added to them.
>
>
>Please excuse:  Before my last post, I had written code to do all
> that.  In brief, the political offices are "h3" tags.  I used "strsplit"
> to split the string at "".  I then wrote a function to find "",
> extract the political office and pass the rest to "XML::readHTMLTable",
> adding columns for party and political office.
>
>
>However, this suppressed "" everywhere.  I thought there
> should be an

Re: [R] [External] Re: help with web scraping

2020-07-25 Thread Spencer Graves

Dear Rasmus Liland et al.:

On 2020-07-25 11:30, Rasmus Liland wrote:

On 2020-07-25 09:56 -0500, Spencer Graves wrote:

Dear Rasmus et al.:

It is LILAND et al., is it not? ... else it's customary to
put a comma in there, isn't it? ...

The APA Style recommends "Sharp et al., 2007":

https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html

Regarding Confucius, I'm confused.

right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:

Please research using Thunderbird, Claws
mail, or some other sane e-mail client;
they are great, I promise.

Thanks. I researched it and turned of HTML. Please excuse: I noticed
it was a problem, but hadn't prioritized time to research and fix it
until your comment. Thanks.

Please excuse:? Before my last post, I
had written code to do all that.?

Good!

In brief, the political offices are
"h3" tags.?

Yes, some type of header element at
least, in-between the various tables,
everything children of the div in the
element tree.

I used "strsplit" to split the string
at "".? I then wrote a
function to find "", extract the
political office and pass the rest to
"XML::readHTMLTable", adding columns
for party and political office.

Yes, doing that for the political office
is also possible, but the party is
inside the table's caption tag, which
end up as the name of the table in the
XML::readHTMLTable list ...

However, this suppressed ""
everywhere.?

Why is that, please explain.

I don't know why the Missouri Secretary of State's web site includes
"" to signal a new line, but it does. I also don't know why
XML::readHTMLTable suppressed "" everywhere it occurred, but it did
that. After I used gsub to replace "" with "\n", I found that
XML::readHTMLTable did not replace "\n", so I got what I wanted.

I thought there should be
an option with something like
"XML::readHTMLTable" that would not
delete "" everywhere, but I
couldn't find it.?

No, there is not, AFAIK. Please, if
anyone else knows, please say so *echoes
in the forest*

If you aren't aware of one, I can
gsub("", "\n", ...) on the string
for each political office before
passing it to "XML::readHTMLTable".? I
just tested this:? It works.

Such a great hack! IMHO, this is much
more flexible than using
xml2::read_html, rvest::read_table,
dplyr::mutate like here[1]

I have other web scraping problems in
my work plan for the few days.?

Maybe, idk ...

I will definitely try
XML::htmlTreeParse, etc., as you
suggest.

I wish you good luck,
Rasmus

[1]
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

And I added my solution to this problem to this Stackoverflow thread.

Thanks again,
Spencer

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-25 Thread Rasmus Liland

On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> Dear Rasmus et al.:

It is LILAND et al., is it not?  I do 
not belong to a large Confucian family 
structure (putting the hunter-gatherer 
horse-rider tribe name first in all-caps 
in the email), else it's customary to 
put a comma in there, isn't it? ... 
right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:
> 
>  ? 

It might be a better idea to write the 
reply in plain-text utf-8 or at least 
Western or Eastern-European ISO euro 
encoding instead of us-ascii (maybe 
KOI8, ¯\_(ツ)_/¯) ...  something in your 
email got string-replaced by "?" and 
also "«" got replaced by "?".

Please research using Thunderbird, Claws 
mail, or some other sane e-mail client; 
they are great, I promise.

> Please excuse:? Before my last post, I 
> had written code to do all that.? 

Good!

> In brief, the political offices are 
> "h3" tags.?

Yes, some type of header element at 
least, in-between the various tables, 
everything children of the div in the 
element tree.

> I used "strsplit" to split the string 
> at "".? I then wrote a 
> function to find "", extract the 
> political office and pass the rest to 
> "XML::readHTMLTable", adding columns 
> for party and political office.

Yes, doing that for the political office 
is also possible, but the party is 
inside the table's caption tag, which 
end up as the name of the table in the 
XML::readHTMLTable list ...

> However, this suppressed "" 
> everywhere.?

Why is that, please explain.

> I thought there should be 
> an option with something like 
> "XML::readHTMLTable" that would not 
> delete "" everywhere, but I 
> couldn't find it.?

No, there is not, AFAIK.  Please, if 
anyone else knows, please say so *echoes 
in the forest*

> If you aren't aware of one, I can 
> gsub("", "\n", ...) on the string 
> for each political office before 
> passing it to "XML::readHTMLTable".? I 
> just tested this:? It works.

Such a great hack!  IMHO, this is much 
more flexible than using 
xml2::read_html, rvest::read_table, 
dplyr::mutate like here[1]

> I have other web scraping problems in 
> my work plan for the few days.?

Maybe, idk ... 

> I will definitely try 
> XML::htmlTreeParse, etc., as you 
> suggest.

I wish you good luck,
Rasmus

[1] 
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells

signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-25 Thread Spencer Graves

Dear Rasmus et al.:


On 2020-07-25 04:10, Rasmus Liland wrote:
> On 2020-07-24 10:28 -0500, Spencer Graves wrote:
>> Dear Rasmus:
>>
>>> Dear Spencer,
>>>
>>> I unified the party tables after the
>>> first summary table like this:
>>>
>>> url <- 
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
>>> M_sos <- RCurl::getURL(url)
>>> saveRDS(object=M_sos, file="dcp.rds")
>>> dat <- XML::readHTMLTable(M_sos)
>>> idx <- 2:length(dat)
>>> cn <- unique(unlist(lapply(dat[idx], colnames)))
>> This is useful for this application.
>>
>>> dat <- do.call(rbind,
>>>   sapply(idx, function(i, dat, cn) {
>>> x <- dat[[i]]
>>> x[,cn[!(cn %in% colnames(x))]] <- NA
>>> x <- x[,cn]
>>> x$Party <- names(dat)[i]
>>> return(list(x))
>>>   }, dat=dat, cn=cn))
>>> dat[,"Date Filed"] <-
>>>   as.Date(x=dat[,"Date Filed"],
>>>   format="%m/%d/%Y")
>> This misses something extremely
>> important for this application:?  The
>> political office.? That's buried in
>> the HTML or whatever it is.? I'm using
>> something like the following to find
>> that:
>>
>> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> Dear Spencer,
>
> I came up with a solution, but it is not
> very elegant.  Instead of showing you
> the solution, hoping you understand
> everything in it, I istead want to give
> you some emphatic hints to see if you
> can come up with a solution on you own.
>
> - XML::htmlTreeParse(M_sos)
>- *Gandalf voice*: climb the tree
>  until you find the content you are
>  looking for flat out at the level of
>  �The Children of the Div�, *uuuUUU*
>- you only want to keep the table and
>  header tags at this level
> - Use XML::xmlValue to extract the
>values of all the headers (the
>political positions)
> - Observe that all the tables on the
>page you were able to extract
>previously using XML::readHTMLTable,
>are at this level, shuffled between
>the political position header tags,
>this means you extract the political
>position and party affiliation by
>using a for loop, if statements,
>typeof, names, and [] and [[]] to grab
>different things from the list
>(content or the bag itself).
>XML::readHTMLTable strips away the
>line break tags from the Mailing
>address, so if you find a better way
>of extracting the tables, tell me,
>e.g. you get
>
>   8805 HUNTER AVEKANSAS CITY MO 64138
>
>and not
>
>   8805 HUNTER AVEKANSAS CITY MO 64138
>
> When you've completed this �programming
> quest�, you're back at the level of the
> previous email, i.e.  you have have the
> same tables, but with political position
> and party affiliation added to them.


 � Please excuse:� Before my last post, I had written code to do all 
that.� In brief, the political offices are "h3" tags.� I used "strsplit" 
to split the string at "".� I then wrote a function to find "", 
extract the political office and pass the rest to "XML::readHTMLTable", 
adding columns for party and political office.


 � However, this suppressed "" everywhere.� I thought there 
should be an option with something like "XML::readHTMLTable" that would 
not delete "" everywhere, but I couldn't find it.� If you aren't 
aware of one, I can gsub("", "\n", ...) on the string for each 
political office before passing it to "XML::readHTMLTable".� I just 
tested this:� It works.


 � I have other web scraping problems in my work plan for the few 
days.� I will definitely try XML::htmlTreeParse, etc., as you suggest.


 � Thanks again.
 � Spencer Graves
>
> Best,
> Rasmus
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-25 Thread Rasmus Liland

On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> Dear Rasmus:
> 
> > Dear Spencer,
> >
> > I unified the party tables after the
> > first summary table like this:
> >
> > url <- 
> > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> > M_sos <- RCurl::getURL(url)
> > saveRDS(object=M_sos, file="dcp.rds")
> > dat <- XML::readHTMLTable(M_sos)
> > idx <- 2:length(dat)
> > cn <- unique(unlist(lapply(dat[idx], colnames)))
> 
> This is useful for this application.
> 
> > dat <- do.call(rbind,
> >   sapply(idx, function(i, dat, cn) {
> > x <- dat[[i]]
> > x[,cn[!(cn %in% colnames(x))]] <- NA
> > x <- x[,cn]
> > x$Party <- names(dat)[i]
> > return(list(x))
> >   }, dat=dat, cn=cn))
> > dat[,"Date Filed"] <-
> >   as.Date(x=dat[,"Date Filed"],
> >   format="%m/%d/%Y")
> 
> This misses something extremely 
> important for this application:?  The 
> political office.? That's buried in 
> the HTML or whatever it is.? I'm using 
> something like the following to find 
> that:
> 
> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])

Dear Spencer,

I came up with a solution, but it is not 
very elegant.  Instead of showing you 
the solution, hoping you understand 
everything in it, I istead want to give 
you some emphatic hints to see if you 
can come up with a solution on you own.

- XML::htmlTreeParse(M_sos)
  - *Gandalf voice*: climb the tree 
until you find the content you are 
looking for flat out at the level of 
«The Children of the Div», *uuuUUU*
  - you only want to keep the table and 
header tags at this level
- Use XML::xmlValue to extract the 
  values of all the headers (the 
  political positions)
- Observe that all the tables on the 
  page you were able to extract 
  previously using XML::readHTMLTable, 
  are at this level, shuffled between 
  the political position header tags, 
  this means you extract the political 
  position and party affiliation by 
  using a for loop, if statements, 
  typeof, names, and [] and [[]] to grab 
  different things from the list 
  (content or the bag itself). 
  XML::readHTMLTable strips away the 
  line break tags from the Mailing 
  address, so if you find a better way 
  of extracting the tables, tell me, 
  e.g. you get

8805 HUNTER AVEKANSAS CITY MO 64138

  and not 

8805 HUNTER AVEKANSAS CITY MO 64138

When you've completed this «programming 
quest», you're back at the level of the 
previous email, i.e.  you have have the 
same tables, but with political position 
and party affiliation added to them.

Best,
Rasmus


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-24 Thread Spencer Graves

Dear Rasmus:


On 2020-07-24 09:16, Rasmus Liland wrote:
> On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote:
>> On Fri, 24 Jul 2020, Spencer Graves wrote:
>>> On 2020-07-23 17:46, William Michels wrote:
 On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
  wrote:
> Hello, All:
>
> I've failed with multiple
> attempts to scrape the table of
> candidates from the website of
> the Missouri Secretary of
> State:
>
> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
 Hi Spencer,

 I tried the code below on an older
 R-installation, and it works fine.
 Not a full solution, but it's a
 start:

> library(RCurl)
 Loading required package: bitops
> url <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> M_sos <- getURL(url)
>>> Hi Bill et al.:
>>>
>>> That broke the dam:� It gave me a
>>> character vector of length 1
>>> consisting of 218 KB.� I fed that to
>>> XML::readHTMLTable and
>>> purrr::map_chr, both of which
>>> returned lists of 337 data.frames.
>>> The former retained names for all
>>> the tables, absent from the latter.
>>> The columns of the former are all
>>> character;� that's not true for the
>>> latter.
>>>
>>> Sadly, it's not quite what I want:
>>> It's one table for each office-party
>>> combination, but it's lost the
>>> office designation. However, I'm
>>> confident I can figure out how to
>>> hack that.
>> Maybe try something like this:
>>
>> url <- 
>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
>> h <- xml2::read_html(url)
>> tbl <- rvest::html_table(h)
> Dear Spencer,
>
> I unified the party tables after the
> first summary table like this:
>
>   url <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
>   M_sos <- RCurl::getURL(url)
>   saveRDS(object=M_sos, file="dcp.rds")
>   dat <- XML::readHTMLTable(M_sos)
>   idx <- 2:length(dat)
>   cn <- unique(unlist(lapply(dat[idx], colnames)))


 � This is useful for this application.

>   dat <- do.call(rbind,
> sapply(idx, function(i, dat, cn) {
>   x <- dat[[i]]
>   x[,cn[!(cn %in% colnames(x))]] <- NA
>   x <- x[,cn]
>   x$Party <- names(dat)[i]
>   return(list(x))
> }, dat=dat, cn=cn))
>   dat[,"Date Filed"] <-
> as.Date(x=dat[,"Date Filed"],
> format="%m/%d/%Y")


 � This misses something extremely important for this application:� 
The political office.� That's buried in the HTML or whatever it is.� I'm 
using something like the following to find that:


str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])


 � After I figure this out, I will use something like your code to 
combine it all into separate tables for each office, and then probably 
combine those into one table for the offices I'm interested in.� For my 
present purposes, I don't want all the offices in Missouri, only the 
executive positions and those representing parts of the Kansas City 
metro area in the Missouri legislature.


 � Thanks again,
 � Spencer Graves

>   write.table(dat, file="dcp.tsv", sep="\t",
>   row.names=FALSE,
>   quote=TRUE, na="N/A")
>
> Best,
> Rasmus
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-24 Thread Rasmus Liland

On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote:
> On Fri, 24 Jul 2020, Spencer Graves wrote:
> > On 2020-07-23 17:46, William Michels wrote:
> > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> > >  wrote:
> > > > Hello, All:
> > > > 
> > > > I've failed with multiple 
> > > > attempts to scrape the table of 
> > > > candidates from the website of 
> > > > the Missouri Secretary of 
> > > > State:
> > > > 
> > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
> > > 
> > > Hi Spencer,
> > > 
> > > I tried the code below on an older 
> > > R-installation, and it works fine.  
> > > Not a full solution, but it's a 
> > > start:
> > > 
> > > > library(RCurl)
> > > Loading required package: bitops
> > > > url <- 
> > > > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> > > > M_sos <- getURL(url)
> > 
> > Hi Bill et al.:
> > 
> > That broke the dam:  It gave me a 
> > character vector of length 1 
> > consisting of 218 KB.  I fed that to 
> > XML::readHTMLTable and 
> > purrr::map_chr, both of which 
> > returned lists of 337 data.frames. 
> > The former retained names for all 
> > the tables, absent from the latter.  
> > The columns of the former are all 
> > character;  that's not true for the 
> > latter.
> > 
> > Sadly, it's not quite what I want:  
> > It's one table for each office-party 
> > combination, but it's lost the 
> > office designation. However, I'm 
> > confident I can figure out how to 
> > hack that.
> 
> Maybe try something like this:
> 
> url <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
> h <- xml2::read_html(url)
> tbl <- rvest::html_table(h)

Dear Spencer,

I unified the party tables after the 
first summary table like this:

url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
M_sos <- RCurl::getURL(url)
saveRDS(object=M_sos, file="dcp.rds")
dat <- XML::readHTMLTable(M_sos)
idx <- 2:length(dat)
cn <- unique(unlist(lapply(dat[idx], colnames)))
dat <- do.call(rbind,
  sapply(idx, function(i, dat, cn) {
x <- dat[[i]]
x[,cn[!(cn %in% colnames(x))]] <- NA
x <- x[,cn]
x$Party <- names(dat)[i]
return(list(x))
  }, dat=dat, cn=cn))
dat[,"Date Filed"] <-
  as.Date(x=dat[,"Date Filed"],
  format="%m/%d/%Y")
write.table(dat, file="dcp.tsv", sep="\t",
row.names=FALSE,
quote=TRUE, na="N/A") 

Best,
Rasmus


signature.asc
Description: PGP signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-24 Thread Spencer Graves





On 2020-07-24 08:20, luke-tier...@uiowa.edu wrote:

Maybe try something like this:

url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;

h <- xml2::read_html(url)



Error in open.connection(x, "rb") : HTTP error 404.


  Thanks for the suggestion, but this failed for me on the platform 
described in "sessionInfo" below.




tbl <- rvest::html_table(h)



  As I previously noted, RCurl::getURL returned a single character 
string of roughly 218 KB, from which I've so far gotten most but not all 
of what I want.  Unfortunately, when I fed that character vector to 
rvest::html_table, I got:



Error in UseMethod("html_table") :
  no applicable method for 'html_table' applied to an object of class 
"character"



  I don't know for sure yet, but I believe I'll be able to get what 
I want from the single character string using, e.g., gregexpr and other 
functions.



  Thanks again,
  Spencer Graves



Best,

luke

On Fri, 24 Jul 2020, Spencer Graves wrote:


Hi Bill et al.:


  That broke the dam:  It gave me a character vector of length 1 
consisting of 218 KB.  I fed that to XML::readHTMLTable and 
purrr::map_chr, both of which returned lists of 337 data.frames. The 
former retained names for all the tables, absent from the latter.  
The columns of the former are all character;  that's not true for the 
latter.



  Sadly, it's not quite what I want:  It's one table for each 
office-party combination, but it's lost the office designation. 
However, I'm confident I can figure out how to hack that.



  Thanks,
  Spencer Graves


On 2020-07-23 17:46, William Michels wrote:

Hi Spencer,

I tried the code below on an older R-installation, and it works fine.
Not a full solution, but it's a start:


library(RCurl)

Loading required package: bitops
url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;

M_sos <- getURL(url)
print(M_sos)

[1] "\r\n\r\n\r\n\r\n\r\n\tSOS, Missouri - Elections:
Offices Filed in Candidate Filing\r\n wrote:

Hello, All:


    I've failed with multiple attempts to scrape the table of
candidates from the website of the Missouri Secretary of State:


https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 




    I've tried base::url, base::readLines, xml2::read_html, and
XML::readHTMLTable; see summary below.


    Suggestions?
    Thanks,
    Spencer Graves


sosURL <-
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975; 



str(baseURL <- base::url(sosURL))
# this might give me something, but I don't know what

sosRead <- base::readLines(sosURL) # 404 Not Found
sosRb <- base::readLines(baseURL) # 404 Not Found

sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.

sosXML <- XML::readHTMLTable(sosURL)
# List of 0;  does not seem to be XML

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 


LAPACK:
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 



locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets
[6] methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2    curl_4.3
[4] xml2_1.3.2 XML_3.99-0.3

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.





__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

2020-07-24 Thread luke-tierney


Maybe try something like this:

url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;
h <- xml2::read_html(url)
tbl <- rvest::html_table(h)

Best,

luke

On Fri, 24 Jul 2020, Spencer Graves wrote:


Hi Bill et al.:


  That broke the dam:  It gave me a character vector of length 1 
consisting of 218 KB.  I fed that to XML::readHTMLTable and purrr::map_chr, 
both of which returned lists of 337 data.frames. The former retained names 
for all the tables, absent from the latter.  The columns of the former are 
all character;  that's not true for the latter.



  Sadly, it's not quite what I want:  It's one table for each 
office-party combination, but it's lost the office designation. However, I'm 
confident I can figure out how to hack that.



  Thanks,
  Spencer Graves


On 2020-07-23 17:46, William Michels wrote:

Hi Spencer,

I tried the code below on an older R-installation, and it works fine.
Not a full solution, but it's a start:


library(RCurl)

Loading required package: bitops
url <- 
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;

M_sos <- getURL(url)
print(M_sos)

[1] "\r\n\r\n\r\n\r\n\r\n\tSOS, Missouri - Elections:
Offices Filed in Candidate Filing\r\n wrote:

Hello, All:


I've failed with multiple attempts to scrape the table of
candidates from the website of the Missouri Secretary of State:


https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975


I've tried base::url, base::readLines, xml2::read_html, and
XML::readHTMLTable; see summary below.


Suggestions?
Thanks,
Spencer Graves


sosURL <-
"https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975;

str(baseURL <- base::url(sosURL))
# this might give me something, but I don't know what

sosRead <- base::readLines(sosURL) # 404 Not Found
sosRb <- base::readLines(baseURL) # 404 Not Found

sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.

sosXML <- XML::readHTMLTable(sosURL)
# List of 0;  does not seem to be XML

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:
/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK:
/Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets
[6] methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2curl_4.3
[4] xml2_1.3.2 XML_3.99-0.3

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

Re: [R] [External] Re: help with web scraping

11 matches

Site Navigation

Mail list logo

Footer information