Whoops, it seems I could use some help with regular expressions...

Consider the following two functions, creating a search string, and retrieving the content from the url,

makeURLsearch <- function(key, dates=c(NULL, NULL)){
        
        base.search <- "http://scholar.google.co.uk/scholar?";
        key.search <- paste("as_q=", key,"&",  sep="")
other.search <- "num=10&btnG=Search + Scholar &as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&" dates.search <- paste("as_ylo=", dates[1], "&as_yhi=", dates[2], "&as_allsubj=all&hl=en&lr=", sep="")
        
full.search <- paste(base.search, key.search, other.search, dates.search, sep="")
        return(full.search)
}


makeURLsearch("plasmonics")
makeURLsearch("photonics", c(1980, NULL))

retrieveNumberPublications <- function(url){
        
        x <- readLines(url)
        y <- grep('of about',x, value=TRUE)
z <- gsub('of about\\s+</b>','\\1',y[1],perl=TRUE) # this does not do what I wanted

        # the bit to retrieve is the number below
        #  <b>10</b> of about <b>21,900</b> for <b><b>photonics</b>
        z
}

retrieveNumberPublications( makeURLsearch("photonics", c(2008, NULL)) )

I can isolate the long string containing the result I want, but not single out the value which lies between " <b>10</b> of about <b>21,900</b> for <b><b>photonics</b> " .

Any regexp guru to help me out? I've never got my head around these, other than trivial cases.

Many thanks,

baptiste


On 15 Jan 2009, at 09:45, baptiste auguie wrote:

For the record, I thought I'd share two findings:

First, the web of science website does seem to have some sort of API,
as discussed here:

http://scientific.thomson.com/support/faq/webservices/
It does not seem like a trivial thing to set up though.

Second, because I could not pass the search term easily in the
address, I looked into Google scholar instead, where a typical search
looks like:
http://scholar.google.co.uk/scholar?as_q=plasmonics&num=10&btnG=Search+Scholar&as_epq=&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=1960&as_allsubj=all&hl=en&lr=

here it is trivial to create such a string with the desired keyword
and dates, and retrieve the number of results using readLines(url) and
grep.


Thanks to Phil Spector for some pointers.

Best wishes,

baptiste

_____________________________

Baptiste AuguiƩ

School of Physics
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK

Phone: +44 1392 264187

http://newton.ex.ac.uk/research/emag

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to