subject:"\[R\] scraping with session cookies"

Re: [R] scraping with session cookies

2012-09-23 Thread Heramb Gadgil

This may be because connection to the site via R is taking a lot of time. I
too faced this problem for the site Social-Mention.

I tried very primitive approach. I put the 'if' condition in the loop.

if(length(output)==0){getURL(site)
}else{continue with the code}

It might help you.

Best,
Heramb

On Fri, Sep 21, 2012 at 8:45 PM, CPV ceal...@gmail.com wrote:

 Thanks for your suggestion,
 The issue was resolved by Duncan's recommendation.

 Now I am trying to obtain data from different pages from the same site
 through a loop, however, the getURLContent keeps timing out, the odd part
 is that I can access to the link through a browser with no issues at all!!
 Any ideas why it keeps timing out? Also how can I keep the loop running
 after this error?

 Thanks again for your help!


 On Wed, Sep 19, 2012 at 11:36 PM, Heramb Gadgil 
 heramb.gad...@gmail.comwrote:

 Try this,


 library(RCurl)
 library(XML)

 site-
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 

 URL-getURL(site)

 Text=htmlParse(URL,asText=T)

 This will give you all the web dat in an HTML-Text format.

 You can use getNodeSet function to extract whatever links or texts that
 you want from that page.


 I hope this helps.

 Best,
 Heramb



 On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote:

 Thanks again,

 I run the script with the postForm(site, disclaimer_action=I Agree) and
 it does not seem to do anything,
 the webpage is still the disclaimer page thus I am getting the error
 below
 Error in function (classes, fdef, mtable)  :
   unable to find an inherited method for function readHTMLTable, for
 signature NULL


 I also downloaded the latest version of RHTMLForms
 (omegahat-RHTMLForms-251743f.zip)
 and it does not seem to install correctly.. I used the code

 install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip,
 type=win.binary, repos=NULL)

 Any suggestion of what could be causing these problems?


 On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang 
 dtemplel...@ucdavis.edu
  wrote:

   You don't need to use the  getHTMLFormDescription() and
 createFunction().
  Instead, you can use the postForm() call.  However,
  getHTMLFormDescription(),
  etc. is more general. But you need the very latest version of the
 package
  to deal with degenerate forms that have no inputs (other than button
  clicks).
 
   You can get the latest version of the RHTMLForms package
   from github
 
git clone g...@github.com:omegahat/RHTMLForms.git
 
   and that has the fixes for handling the degenerate forms with
   no arguments.
 
 D.
 
  On 9/19/12 7:51 AM, CPV wrote:
   Thank you for your help Duncan,
  
   I have been trying what you suggested however  I am getting an error
 when
   trying to create the function fun- createFunction(forms[[1]])
   it says Error in isHidden I hasDefault :
   operations are possible only for numeric, logical or complex types
  
   On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
   dtemplel...@ucdavis.edu wrote:
  
   Hi ?
  
   The key is that you want to use the same curl handle
   for both the postForm() and for getting the data document.
  
   site = u =
   
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
   
  
   library(RCurl)
   curl = getCurlHandle(cookiefile = , verbose = TRUE)
  
   postForm(site, disclaimer_action=I Agree)
  
   Now we have the cookie in the curl handle so we can use that same
 curl
   handle
   to request the data document:
  
   txt = getURLContent(u, curl = curl)
  
   Now we can use readHTMLTable() on the local document content:
  
   library(XML)
   tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors =
  FALSE)
  
  
  
   Rather than knowing how to post the form, I like to read
   the form programmatically and generate an R function to do the
  submission
   for me. The RHTMLForms package can do this.
  
   library(RHTMLForms)
   forms = getHTMLFormDescription(u, FALSE)
   fun = createFunction(forms[[1]])
  
   Then we can use
  
fun(.curl = curl)
  
   instead of
  
 postForm(site, disclaimer_action=I Agree)
  
   This helps to abstract the details of the form.
  
 D.
  
   On 9/18/12 5:57 PM, CPV wrote:
   Hi, I am starting coding in r and one of the things that i want to
 do
  is
   to
   scrape some data from the web.
   The problem that I am having is that I cannot get passed the
 disclaimer
   page (which produces a session cookie). I have been able to collect
  some
   ideas and combine them in the code below but I dont get passed the
   disclaimer page.
   I am trying to agree the disclaimer with the postForm and write the
   cookie
   to a file, but I cannot do it succesfully
   The webpage cookies are written to the file but the value is
 FALSE...
  So
   any ideas of what I should do or what I am doing wrong with?
   Thank you for your help,

Re: [R] scraping with session cookies

2012-09-21 Thread CPV

Thanks for your suggestion,
The issue was resolved by Duncan's recommendation.

Now I am trying to obtain data from different pages from the same site
through a loop, however, the getURLContent keeps timing out, the odd part
is that I can access to the link through a browser with no issues at all!!
Any ideas why it keeps timing out? Also how can I keep the loop running
after this error?

Thanks again for your help!

On Wed, Sep 19, 2012 at 11:36 PM, Heramb Gadgil heramb.gad...@gmail.comwrote:

 Try this,


 library(RCurl)
 library(XML)

 site-
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 

 URL-getURL(site)

 Text=htmlParse(URL,asText=T)

 This will give you all the web dat in an HTML-Text format.

 You can use getNodeSet function to extract whatever links or texts that
 you want from that page.


 I hope this helps.

 Best,
 Heramb



 On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote:

 Thanks again,

 I run the script with the postForm(site, disclaimer_action=I Agree) and
 it does not seem to do anything,
 the webpage is still the disclaimer page thus I am getting the error below
 Error in function (classes, fdef, mtable)  :
   unable to find an inherited method for function readHTMLTable, for
 signature NULL


 I also downloaded the latest version of RHTMLForms
 (omegahat-RHTMLForms-251743f.zip)
 and it does not seem to install correctly.. I used the code

 install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip,
 type=win.binary, repos=NULL)

 Any suggestion of what could be causing these problems?


 On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang 
 dtemplel...@ucdavis.edu
  wrote:

   You don't need to use the  getHTMLFormDescription() and
 createFunction().
  Instead, you can use the postForm() call.  However,
  getHTMLFormDescription(),
  etc. is more general. But you need the very latest version of the
 package
  to deal with degenerate forms that have no inputs (other than button
  clicks).
 
   You can get the latest version of the RHTMLForms package
   from github
 
git clone g...@github.com:omegahat/RHTMLForms.git
 
   and that has the fixes for handling the degenerate forms with
   no arguments.
 
 D.
 
  On 9/19/12 7:51 AM, CPV wrote:
   Thank you for your help Duncan,
  
   I have been trying what you suggested however  I am getting an error
 when
   trying to create the function fun- createFunction(forms[[1]])
   it says Error in isHidden I hasDefault :
   operations are possible only for numeric, logical or complex types
  
   On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
   dtemplel...@ucdavis.edu wrote:
  
   Hi ?
  
   The key is that you want to use the same curl handle
   for both the postForm() and for getting the data document.
  
   site = u =
   
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
   
  
   library(RCurl)
   curl = getCurlHandle(cookiefile = , verbose = TRUE)
  
   postForm(site, disclaimer_action=I Agree)
  
   Now we have the cookie in the curl handle so we can use that same
 curl
   handle
   to request the data document:
  
   txt = getURLContent(u, curl = curl)
  
   Now we can use readHTMLTable() on the local document content:
  
   library(XML)
   tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors =
  FALSE)
  
  
  
   Rather than knowing how to post the form, I like to read
   the form programmatically and generate an R function to do the
  submission
   for me. The RHTMLForms package can do this.
  
   library(RHTMLForms)
   forms = getHTMLFormDescription(u, FALSE)
   fun = createFunction(forms[[1]])
  
   Then we can use
  
fun(.curl = curl)
  
   instead of
  
 postForm(site, disclaimer_action=I Agree)
  
   This helps to abstract the details of the form.
  
 D.
  
   On 9/18/12 5:57 PM, CPV wrote:
   Hi, I am starting coding in r and one of the things that i want to
 do
  is
   to
   scrape some data from the web.
   The problem that I am having is that I cannot get passed the
 disclaimer
   page (which produces a session cookie). I have been able to collect
  some
   ideas and combine them in the code below but I dont get passed the
   disclaimer page.
   I am trying to agree the disclaimer with the postForm and write the
   cookie
   to a file, but I cannot do it succesfully
   The webpage cookies are written to the file but the value is
 FALSE...
  So
   any ideas of what I should do or what I am doing wrong with?
   Thank you for your help,
  
   library(RCurl)
   library(XML)
  
   site - 
  
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
   
  
   postForm(site, disclaimer_action=I Agree)
  
   cf - cookies.txt
  
   no_cookie - function() {
   curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
   getURL(site,

Re: [R] scraping with session cookies

2012-09-19 Thread Duncan Temple Lang

Hi ?

The key is that you want to use the same curl handle
for both the postForm() and for getting the data document.

site = u =
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18;

library(RCurl)
curl = getCurlHandle(cookiefile = , verbose = TRUE)

postForm(site, disclaimer_action=I Agree)

Now we have the cookie in the curl handle so we can use that same curl handle
to request the data document:

txt = getURLContent(u, curl = curl)

Now we can use readHTMLTable() on the local document content:

library(XML)
tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



Rather than knowing how to post the form, I like to read
the form programmatically and generate an R function to do the submission
for me. The RHTMLForms package can do this.

library(RHTMLForms)
forms = getHTMLFormDescription(u, FALSE)
fun = createFunction(forms[[1]])

Then we can use

 fun(.curl = curl)

instead of

  postForm(site, disclaimer_action=I Agree)

This helps to abstract the details of the form.

  D.

On 9/18/12 5:57 PM, CPV wrote:
 Hi, I am starting coding in r and one of the things that i want to do is to
 scrape some data from the web.
 The problem that I am having is that I cannot get passed the disclaimer
 page (which produces a session cookie). I have been able to collect some
 ideas and combine them in the code below but I dont get passed the
 disclaimer page.
 I am trying to agree the disclaimer with the postForm and write the cookie
 to a file, but I cannot do it succesfully
 The webpage cookies are written to the file but the value is FALSE... So
 any ideas of what I should do or what I am doing wrong with?
 Thank you for your help,
 
 library(RCurl)
 library(XML)
 
 site - 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18;
 
 postForm(site, disclaimer_action=I Agree)
 
 cf - cookies.txt
 
 no_cookie - function() {
 curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
 getURL(site, curl=curlHandle)
 
 rm(curlHandle)
 gc()
 }
 
 if ( file.exists(cf) == TRUE ) {
 file.create(cf)
 no_cookie()
 }
 allTables - readHTMLTable(site)
 allTables
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] scraping with session cookies

2012-09-19 Thread CPV

Thank you for your help Duncan,

I have been trying what you suggested however  I am getting an error when
trying to create the function fun- createFunction(forms[[1]])
it says Error in isHidden I hasDefault :
operations are possible only for numeric, logical or complex types

On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
dtemplel...@ucdavis.edu wrote:

 Hi ?

 The key is that you want to use the same curl handle
 for both the postForm() and for getting the data document.

 site = u =
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 

 library(RCurl)
 curl = getCurlHandle(cookiefile = , verbose = TRUE)

 postForm(site, disclaimer_action=I Agree)

 Now we have the cookie in the curl handle so we can use that same curl
 handle
 to request the data document:

 txt = getURLContent(u, curl = curl)

 Now we can use readHTMLTable() on the local document content:

 library(XML)
 tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



 Rather than knowing how to post the form, I like to read
 the form programmatically and generate an R function to do the submission
 for me. The RHTMLForms package can do this.

 library(RHTMLForms)
 forms = getHTMLFormDescription(u, FALSE)
 fun = createFunction(forms[[1]])

 Then we can use

  fun(.curl = curl)

 instead of

   postForm(site, disclaimer_action=I Agree)

 This helps to abstract the details of the form.

   D.

 On 9/18/12 5:57 PM, CPV wrote:
  Hi, I am starting coding in r and one of the things that i want to do is
 to
  scrape some data from the web.
  The problem that I am having is that I cannot get passed the disclaimer
  page (which produces a session cookie). I have been able to collect some
  ideas and combine them in the code below but I dont get passed the
  disclaimer page.
  I am trying to agree the disclaimer with the postForm and write the
 cookie
  to a file, but I cannot do it succesfully
  The webpage cookies are written to the file but the value is FALSE... So
  any ideas of what I should do or what I am doing wrong with?
  Thank you for your help,
 
  library(RCurl)
  library(XML)
 
  site - 
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 
 
  postForm(site, disclaimer_action=I Agree)
 
  cf - cookies.txt
 
  no_cookie - function() {
  curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
  getURL(site, curl=curlHandle)
 
  rm(curlHandle)
  gc()
  }
 
  if ( file.exists(cf) == TRUE ) {
  file.create(cf)
  no_cookie()
  }
  allTables - readHTMLTable(site)
  allTables
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] scraping with session cookies

2012-09-19 Thread Duncan Temple Lang

 You don't need to use the  getHTMLFormDescription() and createFunction().
Instead, you can use the postForm() call.  However, getHTMLFormDescription(),
etc. is more general. But you need the very latest version of the package
to deal with degenerate forms that have no inputs (other than button clicks).

 You can get the latest version of the RHTMLForms package
 from github

  git clone g...@github.com:omegahat/RHTMLForms.git

 and that has the fixes for handling the degenerate forms with
 no arguments.

   D.

On 9/19/12 7:51 AM, CPV wrote:
 Thank you for your help Duncan,
 
 I have been trying what you suggested however  I am getting an error when
 trying to create the function fun- createFunction(forms[[1]])
 it says Error in isHidden I hasDefault :
 operations are possible only for numeric, logical or complex types
 
 On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
 dtemplel...@ucdavis.edu wrote:
 
 Hi ?

 The key is that you want to use the same curl handle
 for both the postForm() and for getting the data document.

 site = u =
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 

 library(RCurl)
 curl = getCurlHandle(cookiefile = , verbose = TRUE)

 postForm(site, disclaimer_action=I Agree)

 Now we have the cookie in the curl handle so we can use that same curl
 handle
 to request the data document:

 txt = getURLContent(u, curl = curl)

 Now we can use readHTMLTable() on the local document content:

 library(XML)
 tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE)



 Rather than knowing how to post the form, I like to read
 the form programmatically and generate an R function to do the submission
 for me. The RHTMLForms package can do this.

 library(RHTMLForms)
 forms = getHTMLFormDescription(u, FALSE)
 fun = createFunction(forms[[1]])

 Then we can use

  fun(.curl = curl)

 instead of

   postForm(site, disclaimer_action=I Agree)

 This helps to abstract the details of the form.

   D.

 On 9/18/12 5:57 PM, CPV wrote:
 Hi, I am starting coding in r and one of the things that i want to do is
 to
 scrape some data from the web.
 The problem that I am having is that I cannot get passed the disclaimer
 page (which produces a session cookie). I have been able to collect some
 ideas and combine them in the code below but I dont get passed the
 disclaimer page.
 I am trying to agree the disclaimer with the postForm and write the
 cookie
 to a file, but I cannot do it succesfully
 The webpage cookies are written to the file but the value is FALSE... So
 any ideas of what I should do or what I am doing wrong with?
 Thank you for your help,

 library(RCurl)
 library(XML)

 site - 

 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
 

 postForm(site, disclaimer_action=I Agree)

 cf - cookies.txt

 no_cookie - function() {
 curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
 getURL(site, curl=curlHandle)

 rm(curlHandle)
 gc()
 }

 if ( file.exists(cf) == TRUE ) {
 file.create(cf)
 no_cookie()
 }
 allTables - readHTMLTable(site)
 allTables

   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] scraping with session cookies

2012-09-19 Thread CPV

Thanks again,

I run the script with the postForm(site, disclaimer_action=I Agree) and
it does not seem to do anything,
the webpage is still the disclaimer page thus I am getting the error below
Error in function (classes, fdef, mtable)  :
  unable to find an inherited method for function readHTMLTable, for
signature NULL


I also downloaded the latest version of RHTMLForms
(omegahat-RHTMLForms-251743f.zip)
and it does not seem to install correctly.. I used the code
install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip,
type=win.binary, repos=NULL)

Any suggestion of what could be causing these problems?


On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang dtemplel...@ucdavis.edu
 wrote:

  You don't need to use the  getHTMLFormDescription() and createFunction().
 Instead, you can use the postForm() call.  However,
 getHTMLFormDescription(),
 etc. is more general. But you need the very latest version of the package
 to deal with degenerate forms that have no inputs (other than button
 clicks).

  You can get the latest version of the RHTMLForms package
  from github

   git clone g...@github.com:omegahat/RHTMLForms.git

  and that has the fixes for handling the degenerate forms with
  no arguments.

D.

 On 9/19/12 7:51 AM, CPV wrote:
  Thank you for your help Duncan,
 
  I have been trying what you suggested however  I am getting an error when
  trying to create the function fun- createFunction(forms[[1]])
  it says Error in isHidden I hasDefault :
  operations are possible only for numeric, logical or complex types
 
  On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
  dtemplel...@ucdavis.edu wrote:
 
  Hi ?
 
  The key is that you want to use the same curl handle
  for both the postForm() and for getting the data document.
 
  site = u =
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
  
 
  library(RCurl)
  curl = getCurlHandle(cookiefile = , verbose = TRUE)
 
  postForm(site, disclaimer_action=I Agree)
 
  Now we have the cookie in the curl handle so we can use that same curl
  handle
  to request the data document:
 
  txt = getURLContent(u, curl = curl)
 
  Now we can use readHTMLTable() on the local document content:
 
  library(XML)
  tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors =
 FALSE)
 
 
 
  Rather than knowing how to post the form, I like to read
  the form programmatically and generate an R function to do the
 submission
  for me. The RHTMLForms package can do this.
 
  library(RHTMLForms)
  forms = getHTMLFormDescription(u, FALSE)
  fun = createFunction(forms[[1]])
 
  Then we can use
 
   fun(.curl = curl)
 
  instead of
 
postForm(site, disclaimer_action=I Agree)
 
  This helps to abstract the details of the form.
 
D.
 
  On 9/18/12 5:57 PM, CPV wrote:
  Hi, I am starting coding in r and one of the things that i want to do
 is
  to
  scrape some data from the web.
  The problem that I am having is that I cannot get passed the disclaimer
  page (which produces a session cookie). I have been able to collect
 some
  ideas and combine them in the code below but I dont get passed the
  disclaimer page.
  I am trying to agree the disclaimer with the postForm and write the
  cookie
  to a file, but I cannot do it succesfully
  The webpage cookies are written to the file but the value is FALSE...
 So
  any ideas of what I should do or what I am doing wrong with?
  Thank you for your help,
 
  library(RCurl)
  library(XML)
 
  site - 
 
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
  
 
  postForm(site, disclaimer_action=I Agree)
 
  cf - cookies.txt
 
  no_cookie - function() {
  curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
  getURL(site, curl=curlHandle)
 
  rm(curlHandle)
  gc()
  }
 
  if ( file.exists(cf) == TRUE ) {
  file.create(cf)
  no_cookie()
  }
  allTables - readHTMLTable(site)
  allTables
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained,

Re: [R] scraping with session cookies

2012-09-19 Thread Heramb Gadgil

Try this,

library(RCurl)
library(XML)

site-
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18


URL-getURL(site)

Text=htmlParse(URL,asText=T)

This will give you all the web dat in an HTML-Text format.

You can use getNodeSet function to extract whatever links or texts that
you want from that page.


I hope this helps.

Best,
Heramb


On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote:

 Thanks again,

 I run the script with the postForm(site, disclaimer_action=I Agree) and
 it does not seem to do anything,
 the webpage is still the disclaimer page thus I am getting the error below
 Error in function (classes, fdef, mtable)  :
   unable to find an inherited method for function readHTMLTable, for
 signature NULL


 I also downloaded the latest version of RHTMLForms
 (omegahat-RHTMLForms-251743f.zip)
 and it does not seem to install correctly.. I used the code
 install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip,
 type=win.binary, repos=NULL)

 Any suggestion of what could be causing these problems?


 On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang 
 dtemplel...@ucdavis.edu
  wrote:

   You don't need to use the  getHTMLFormDescription() and
 createFunction().
  Instead, you can use the postForm() call.  However,
  getHTMLFormDescription(),
  etc. is more general. But you need the very latest version of the package
  to deal with degenerate forms that have no inputs (other than button
  clicks).
 
   You can get the latest version of the RHTMLForms package
   from github
 
git clone g...@github.com:omegahat/RHTMLForms.git
 
   and that has the fixes for handling the degenerate forms with
   no arguments.
 
 D.
 
  On 9/19/12 7:51 AM, CPV wrote:
   Thank you for your help Duncan,
  
   I have been trying what you suggested however  I am getting an error
 when
   trying to create the function fun- createFunction(forms[[1]])
   it says Error in isHidden I hasDefault :
   operations are possible only for numeric, logical or complex types
  
   On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang 
   dtemplel...@ucdavis.edu wrote:
  
   Hi ?
  
   The key is that you want to use the same curl handle
   for both the postForm() and for getting the data document.
  
   site = u =
   
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
   
  
   library(RCurl)
   curl = getCurlHandle(cookiefile = , verbose = TRUE)
  
   postForm(site, disclaimer_action=I Agree)
  
   Now we have the cookie in the curl handle so we can use that same curl
   handle
   to request the data document:
  
   txt = getURLContent(u, curl = curl)
  
   Now we can use readHTMLTable() on the local document content:
  
   library(XML)
   tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors =
  FALSE)
  
  
  
   Rather than knowing how to post the form, I like to read
   the form programmatically and generate an R function to do the
  submission
   for me. The RHTMLForms package can do this.
  
   library(RHTMLForms)
   forms = getHTMLFormDescription(u, FALSE)
   fun = createFunction(forms[[1]])
  
   Then we can use
  
fun(.curl = curl)
  
   instead of
  
 postForm(site, disclaimer_action=I Agree)
  
   This helps to abstract the details of the form.
  
 D.
  
   On 9/18/12 5:57 PM, CPV wrote:
   Hi, I am starting coding in r and one of the things that i want to do
  is
   to
   scrape some data from the web.
   The problem that I am having is that I cannot get passed the
 disclaimer
   page (which produces a session cookie). I have been able to collect
  some
   ideas and combine them in the code below but I dont get passed the
   disclaimer page.
   I am trying to agree the disclaimer with the postForm and write the
   cookie
   to a file, but I cannot do it succesfully
   The webpage cookies are written to the file but the value is FALSE...
  So
   any ideas of what I should do or what I am doing wrong with?
   Thank you for your help,
  
   library(RCurl)
   library(XML)
  
   site - 
  
  
 
 http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18
   
  
   postForm(site, disclaimer_action=I Agree)
  
   cf - cookies.txt
  
   no_cookie - function() {
   curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
   getURL(site, curl=curlHandle)
  
   rm(curlHandle)
   gc()
   }
  
   if ( file.exists(cf) == TRUE ) {
   file.create(cf)
   no_cookie()
   }
   allTables - readHTMLTable(site)
   allTables
  
 [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
   http://www.R-project.org/posting-guide.html
   and provide commented, minimal,

[R] scraping with session cookies

2012-09-18 Thread CPV

Hi, I am starting coding in r and one of the things that i want to do is to
scrape some data from the web.
The problem that I am having is that I cannot get passed the disclaimer
page (which produces a session cookie). I have been able to collect some
ideas and combine them in the code below but I dont get passed the
disclaimer page.
I am trying to agree the disclaimer with the postForm and write the cookie
to a file, but I cannot do it succesfully
The webpage cookies are written to the file but the value is FALSE... So
any ideas of what I should do or what I am doing wrong with?
Thank you for your help,

library(RCurl)
library(XML)

site - 
http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18;

postForm(site, disclaimer_action=I Agree)

cf - cookies.txt

no_cookie - function() {
curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf)
getURL(site, curl=curlHandle)

rm(curlHandle)
gc()
}

if ( file.exists(cf) == TRUE ) {
file.create(cf)
no_cookie()
}
allTables - readHTMLTable(site)
allTables

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

Re: [R] scraping with session cookies

[R] scraping with session cookies

8 matches

Site Navigation

Mail list logo

Footer information