Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

Duncan Temple Lang Tue, 27 Jan 2009 08:51:15 -0800


Some Web servers are strict. In this case, it won't accept
a request without being told who is asking, i.e. the User-Agent.


If you use

 getURL("http://www.youtube.com";,
          httpheader = c("User-Agent" = "R (2.9.0)")))

you should get the contents of the page as expected.


(Or with URL uk.youtube.com, etc.)


 D.


clair.crossup...@googlemail.com wrote:

Thank you. The output i get from that example is below:

d = debugGatherer()
getURL("http://uk.youtube.com";,

+          debugfunction = d$update, verbose = TRUE )
[1] ""

d$value()

text

"About to connect() to uk.youtube.com port 80 (#0)\n  Trying
208.117.236.72... connected\nConnected to uk.youtube.com
(208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com
left intact\n"

headerIn

"HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep-
Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r
\nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009
15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX-
Content-Type-Options: nosniff\r\nCache-Control: no-cache\r
\nCneonction: close\r\n\r\n"

headerOut

"GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n"

dataIn

"0\r\n\r\n"

dataOut

""

So the critical information from this is the '400 Bad Request'. A
Google search defines this for me as:

    The request could not be understood by the server due to malformed
    syntax. The client SHOULD NOT repeat the request without
modifications.


looking through sort(both listCurlOptions() and
http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really
help me this time (unless i missed something). Any advice?

Thank you for your time,
C.C

P.S. I can get the d/l to work if i use:

toString(readLines("http://www.uk.youtube.com";))

[1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, ,
\t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin:
0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction
testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t
\tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes =
new Array(16), \t\t\t\tbannersizes[0] = [etc]




On 27 Jan, 13:52, Duncan Temple Lang <dun...@wald.ucdavis.edu> wrote:

clair.crossup...@googlemail.com wrote:

Thank you Duncan.
I remember seeing in your documentation that you have used this
'verbose=TRUE' argument in functions before when trying to see what is
going on. This is good. However, I have not been able to get it to
work for me. Does the output appear in R or do you use some other
external window (i.e. MS DOS window?)?

The libcurl code typically defaults to print on the console.
So on the Windows GUI, this will not show up. Using
a shell (MS DOS window or Unix-like shell) should
should cause the output to be displayed.

A more general way however is to use the debugfunction
option.

d = debugGatherer()

getURL("http://uk.youtube.com";,
         debugfunction = d$update, verbose = TRUE)

When this completes, use

  d$value()

and you have the entire contents that would be displayed on the console.

  D.

library(RCurl)
my.url <- 
'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...
getURL(my.url, verbose = TRUE)

[1] ""
I am having a problem with a new webpage (http://uk.youtube.com/) but
if i can get this verbose to work, then i think i will be able to
google the right action to take based on the information it gives.
Many thanks for your time,
C.C.
On 26 Jan, 16:12, Duncan Temple Lang <dun...@wald.ucdavis.edu> wrote:

clair.crossup...@googlemail.com wrote:

Dear R-help,
There seems to be a web page I am unable to download using RCurl. I
don't understand why it won't download:

library(RCurl)
my.url <- 
"http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...";
getURL(my.url)

[1] ""

  I like the irony that RCurl seems to have difficulties downloading an
article about R.  Good thing it is just a matter of additional arguments
to getURL() or it would be bad news.
The followlocation parameter defaults to FALSE, so
   getURL(my.url, followlocation = TRUE)
gets what you want.
The way I found this  is
  getURL(my.url, verbose = TRUE)
and take a look at the information being sent from R
and received by R from the server.
This gives
* About to connect() towww.nytimes.comport80 (#0)
*   Trying 199.239.136.200... * connected
* Connected towww.nytimes.com(199.239.136.200) port 80 (#0)
 > GET /2009/01/07/technology/business-computing/07program.html?_r=2
HTTP/1.1
Host:www.nytimes.com
Accept: */*
< HTTP/1.1 301 Moved Permanently
< Server: Sun-ONE-Web-Server/6.1
< Date: Mon, 26 Jan 2009 16:10:51 GMT
< Content-length: 0
< Content-type: text/html
< 
Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t...
<
And the 301 is the critical thing here.
  D.

Other web pages are ok to download but this is the first time I have
been unable to download a web page using the very nice RCurl package.
While i can download the webpage using the RDCOMClient, i would like
to understand why it doesn't work as above please?

library(RDCOMClient)
my.url <- 
"http://www.nytimes.com/2009/01/07/technology/business-computing/07pro...";
ie <- COMCreate("InternetExplorer.Application")
txt <- list()
ie$Navigate(my.url)

NULL

while(ie[["Busy"]]) Sys.sleep(1)
txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]]
txt

$`http://www.nytimes.com/2009/01/07/technology/business-computing/
07program.html?_r=2`
[1] "Skip to article Try Electronic Edition Log ...
Many thanks for your time,
C.C
Windows Vista, running with administrator privileges.

sessionInfo()

R version 2.8.1 (2008-12-22)
i386-pc-mingw32
locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
base
other attached packages:
[1] RDCOMClient_0.92-0 RCurl_0.94-0
loaded via a namespace (and not attached):
[1] tools_2.8.1
______________________________________________
r-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
r-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] RCurl unable to download a particular web page -- what is so special about this web page?

Reply via email to