Re: [R] [External] Re: help with web scraping

Spencer Graves Sat, 25 Jul 2020 10:44:20 -0700

Dear Rasmus Liland et al.:


On 2020-07-25 11:30, Rasmus Liland wrote:

On 2020-07-25 09:56 -0500, Spencer Graves wrote:

Dear Rasmus et al.:


It is LILAND et al., is it not?  ... else it's customary to
put a comma in there, isn't it? ...



The APA Style recommends "Sharp et al., 2007":


https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html


          Regarding Confucius, I'm confused.

right, moving on:

On 2020-07-25 04:10, Rasmus Liland wrote:


<snip>


Please research using Thunderbird, Claws
mail, or some other sane e-mail client;
they are great, I promise.

Thanks. I researched it and turned of HTML. Please excuse: I noticedit was a problem, but hadn't prioritized time to research and fix ituntil your comment. Thanks.

Please excuse:? Before my last post, I
had written code to do all that.?


Good!

In brief, the political offices are
"h3" tags.?


Yes, some type of header element at
least, in-between the various tables,
everything children of the div in the
element tree.

I used "strsplit" to split the string
at "<h3>".? I then wrote a
function to find "</h3>", extract the
political office and pass the rest to
"XML::readHTMLTable", adding columns
for party and political office.


Yes, doing that for the political office
is also possible, but the party is
inside the table's caption tag, which
end up as the name of the table in the
XML::readHTMLTable list ...

However, this suppressed "<br/>"
everywhere.?


Why is that, please explain.

I don't know why the Missouri Secretary of State's web site includes"<br/>" to signal a new line, but it does. I also don't know whyXML::readHTMLTable suppressed "<br/>" everywhere it occurred, but it didthat. After I used gsub to replace "<br/>" with "\n", I found thatXML::readHTMLTable did not replace "\n", so I got what I wanted.

I thought there should be
an option with something like
"XML::readHTMLTable" that would not
delete "<br/>" everywhere, but I
couldn't find it.?


No, there is not, AFAIK.  Please, if
anyone else knows, please say so *echoes
in the forest*

If you aren't aware of one, I can
gsub("<br/>", "\n", ...) on the string
for each political office before
passing it to "XML::readHTMLTable".? I
just tested this:? It works.


Such a great hack!  IMHO, this is much
more flexible than using
xml2::read_html, rvest::read_table,
dplyr::mutate like here[1]

I have other web scraping problems in
my work plan for the few days.?


Maybe, idk ...

I will definitely try
XML::htmlTreeParse, etc., as you
suggest.


I wish you good luck,
Rasmus

[1] 
https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells



          And I added my solution to this problem to this Stackoverflow thread.


          Thanks again,
          Spencer



______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [External] Re: help with web scraping

Reply via email to