Whoops, it seems I could use some help with regular expressions...
Consider the following two functions, creating a search string, and
retrieving the content from the url,
makeURLsearch <- function(key, dates=c(NULL, NULL)){
base.search <- "http://scholar.google.co.uk/scholar?"
key.search <- paste("as_q=", key,"&", sep="")
other.search <- "num=10&btnG=Search
+
Scholar
&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&"
dates.search <- paste("as_ylo=", dates[1], "&as_yhi=", dates[2],
"&as_allsubj=all&hl=en&lr=", sep="")
full.search <- paste(base.search, key.search, other.search,
dates.search, sep="")
return(full.search)
}
makeURLsearch("plasmonics")
makeURLsearch("photonics", c(1980, NULL))
retrieveNumberPublications <- function(url){
x <- readLines(url)
y <- grep('of about',x, value=TRUE)
z <- gsub('of about\\s+</b>','\\1',y[1],perl=TRUE) # this does not
do what I wanted
# the bit to retrieve is the number below
# <b>10</b> of about <b>21,900</b> for <b><b>photonics</b>
z
}
retrieveNumberPublications( makeURLsearch("photonics", c(2008,
NULL)) )
I can isolate the long string containing the result I want, but not
single out the value which lies between " <b>10</b> of about
<b>21,900</b> for <b><b>photonics</b> " .
Any regexp guru to help me out? I've never got my head around these,
other than trivial cases.
Many thanks,
baptiste
On 15 Jan 2009, at 09:45, baptiste auguie wrote:
For the record, I thought I'd share two findings:
First, the web of science website does seem to have some sort of API,
as discussed here:
http://scientific.thomson.com/support/faq/webservices/
It does not seem like a trivial thing to set up though.
Second, because I could not pass the search term easily in the
address, I looked into Google scholar instead, where a typical search
looks like:
http://scholar.google.co.uk/scholar?as_q=plasmonics&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=1960&as_allsubj=all&hl=en&lr=
here it is trivial to create such a string with the desired keyword
and dates, and retrieve the number of results using readLines(url) and
grep.
Thanks to Phil Spector for some pointers.
Best wishes,
baptiste
_____________________________
Baptiste AuguiƩ
School of Physics
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK
Phone: +44 1392 264187
http://newton.ex.ac.uk/research/emag
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.