Hi!
Thanks for the code examples. I'll try to elaborate a bit here.
If you paste this:
http://scholar.google.fi/scholar?hl=fi&oe=ASCII&q=Frank+Harrell
to your browser, you'll get the citations, and each citation lists a link
("Import to EndNote") to export a citation in EndNote format.
And, if you save this HTML-file, you'll get the links (they contain a string
"info:") pointing to these EndNote files. For example:
Import
into EndNote
Now, if you do this in R:
curl = getCurlHandle()
z = getForm("http://scholar.google.com/scholar";, q ='Frank Harrell', hl =
'en', btnG = 'Search', oe="ASCII", .opts = list(verbose = TRUE), curl =
curl)
object z does not contain any "info:"-containing links:
grep("info:", z)
integer(0)
Fortunately there is a "related:"-link that gives us the same ID
(U6Gfb4QPVFMJ) as the "info:"-link above:
substr(z, gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19)
[1] "U6Gfb4QPVFMJ"
Now checking from the Google Scholar page, the correct format for the
EndNote query would appear to be:
http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0
You can copy and paste this link to your browser, and save the EndNote
refence as a file.
Yet, when this link is constracted in R:
getURL(paste("http://scholar.google.fi/scholar.enw?q=info:";, substr(z,
gregexpr("related:", z)[[1]]+8, gregexpr("related:", z)[[1]]+19), ":
scholar.google.com/&output=citation&hl=en&oe=ASCII&oe=ASCII&ct=citation&cd=0",
sep=""), curl=curl)
the result is an HTML-file containing "403 Forbidden" error.
But, this type of a functionality seems to be missing from the Google API
(thank to Peter Konings for the link):
http://code.google.com/p/google-ajax-apis/issues/detail?id=109
- Jarno
2009/9/18 Duncan Temple Lang
>
> Hi Jarno
>
> You've only told us half the story. You didn't show how you
> i) performed the original query
> ii) retrieved the URL you used in subsequent queries
>
>
> But I can suggest two possible problems.
>
> a) specifying the cookiejar option tells libcurl where to write the
> cookies that the particular curl handle has collected during its life.
> These are written when the curl handle is destroyed.
> So that wouldn't change the getURL() operation, just change what happens
> when the curl handle is destroyed.
>
> b) You probably mean to use cookiefile rather than cookiejar so that
> the curl request would read existing cookies from a file.
> But in that case, how did that file get created with the correct cookies.
>
> c) libcurl will collect cookies in a curl handle as it receives them from a
> server
> as part of a response. And it will use these in subsequent requests to
> that server.
> But you must be using the same curl handle. Different curl handles are
> entirely
> independent (unless one is copied from another).
> So a possible solution may be that you need to do the initial query with
> the same
> curl handle
>
>
> So I would try something like
>
> curl = getCurlHandle()
> z = getForm("http://scholar.google.com/scholar";, q ='Frank Harrell', hl =
> 'en', btnG = 'Search',
> .opts = list(verbose = TRUE), curl = curl)
>
> dd = htmlParse(z)
> links = getNodeSet(dd, "//a...@href]")
>
> # do something to identify the link you want
>
> tmp = getURL(linkIWant, curl = curl)
>
>
> Note that we are using the same curl object in both requests.
>
>
> This may not do what you want, but if you let us know the details
> about how you are doing the preceding steps, we should be able to sort
> things out.
>
> D.
>
>
> Jarno Tuimala wrote:
> > Hi!
> >
> > I've performed a Google Scholar Search using a query, let's say "Frank
> > Harrell", and parsed the links to the EndNote references from the
> resulting
> > HTML code. Now I'd like to download all the references automatically. For
> > this, I have tried to use RCurl, but I can't seem to get it working: I
> > always get error code "403 Forbidden" from the web server.
> >
> > Initially I tried to do this without using cookies:
> >
> > library(RCurl)
> > getURL("
> >
> http://scholar.google.fi/scholar.enw?q=info:U6Gfb4QPVFMJ:scholar.google.com/&output=citation&hl=fi&oe=ASCII&ct=citation&cd=0
> > ")
> >
> > or
> >
> > getURLContent("
>