subject:"\[R\] Extract Data from a Webpage"

Re: [R] Extract Data from a Webpage

2008-12-16 Thread Duncan Temple Lang



Hi Chuck.

 Well, here is one way

theURL = 
"http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490";


doc = htmlParse(theURL, useInternalNodes = TRUE,
  error = function(...) {}) # discard any error messages

  # Find the nodes in the table that are of interest.
x = xpathSApply(doc, "//table//td|//table//th", xmlValue)

Now depending on the regularity of the page, we can do something like

 i = seq(1, by = 2, length = 3)
 structure(x[i + 1], names = x[i])


And we end up with a named character vector with the fields of interest.

The useInternalNodes is vital so that we can use XPath.  The XPath
language is very convenient for navigating subsets of the resulting
XML tree.

 D.


Chuck Cleland wrote:

Hi All:
  I would like to extract the provider name, address, and phone number
from multiple webpages like this:

http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490

  Based on searching R-help archives, it seems like the XML package
might have something useful for this task.  I can load the XML package
and supply the url as an argument to htmlTreeParse(), but I don't know
how to go from there.

thanks,

Chuck Cleland


sessionInfo()

R version 2.8.0 Patched (2008-12-04 r47066)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] XML_1.98-1



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Extract Data from a Webpage

2008-12-16 Thread Chuck Cleland

Hi All:
  I would like to extract the provider name, address, and phone number
from multiple webpages like this:

http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490

  Based on searching R-help archives, it seems like the XML package
might have something useful for this task.  I can load the XML package
and supply the url as an argument to htmlTreeParse(), but I don't know
how to go from there.

thanks,

Chuck Cleland

> sessionInfo()
R version 2.8.0 Patched (2008-12-04 r47066)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] XML_1.98-1

-- 
Chuck Cleland, Ph.D.
NDRI, Inc. (www.ndri.org)
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 512-0171 (M, W, F)
fax: (917) 438-0894

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extract Data from a Webpage

[R] Extract Data from a Webpage

2 matches

Site Navigation

Mail list logo

Footer information