Re: [R] scraping with session cookies
This may be because connection to the site via R is taking a lot of time. I too faced this problem for the site Social-Mention. I tried very primitive approach. I put the 'if' condition in the loop. if(length(output)==0){getURL(site) }else{continue with the code} It might help you. Best, Heramb On Fri, Sep 21, 2012 at 8:45 PM, CPV ceal...@gmail.com wrote: Thanks for your suggestion, The issue was resolved by Duncan's recommendation. Now I am trying to obtain data from different pages from the same site through a loop, however, the getURLContent keeps timing out, the odd part is that I can access to the link through a browser with no issues at all!! Any ideas why it keeps timing out? Also how can I keep the loop running after this error? Thanks again for your help! On Wed, Sep 19, 2012 at 11:36 PM, Heramb Gadgil heramb.gad...@gmail.comwrote: Try this, library(RCurl) library(XML) site- http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 URL-getURL(site) Text=htmlParse(URL,asText=T) This will give you all the web dat in an HTML-Text format. You can use getNodeSet function to extract whatever links or texts that you want from that page. I hope this helps. Best, Heramb On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote: Thanks again, I run the script with the postForm(site, disclaimer_action=I Agree) and it does not seem to do anything, the webpage is still the disclaimer page thus I am getting the error below Error in function (classes, fdef, mtable) : unable to find an inherited method for function readHTMLTable, for signature NULL I also downloaded the latest version of RHTMLForms (omegahat-RHTMLForms-251743f.zip) and it does not seem to install correctly.. I used the code install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip, type=win.binary, repos=NULL) Any suggestion of what could be causing these problems? On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: You don't need to use the getHTMLFormDescription() and createFunction(). Instead, you can use the postForm() call. However, getHTMLFormDescription(), etc. is more general. But you need the very latest version of the package to deal with degenerate forms that have no inputs (other than button clicks). You can get the latest version of the RHTMLForms package from github git clone g...@github.com:omegahat/RHTMLForms.git and that has the fixes for handling the degenerate forms with no arguments. D. On 9/19/12 7:51 AM, CPV wrote: Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help,
Re: [R] scraping with session cookies
Thanks for your suggestion, The issue was resolved by Duncan's recommendation. Now I am trying to obtain data from different pages from the same site through a loop, however, the getURLContent keeps timing out, the odd part is that I can access to the link through a browser with no issues at all!! Any ideas why it keeps timing out? Also how can I keep the loop running after this error? Thanks again for your help! On Wed, Sep 19, 2012 at 11:36 PM, Heramb Gadgil heramb.gad...@gmail.comwrote: Try this, library(RCurl) library(XML) site- http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 URL-getURL(site) Text=htmlParse(URL,asText=T) This will give you all the web dat in an HTML-Text format. You can use getNodeSet function to extract whatever links or texts that you want from that page. I hope this helps. Best, Heramb On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote: Thanks again, I run the script with the postForm(site, disclaimer_action=I Agree) and it does not seem to do anything, the webpage is still the disclaimer page thus I am getting the error below Error in function (classes, fdef, mtable) : unable to find an inherited method for function readHTMLTable, for signature NULL I also downloaded the latest version of RHTMLForms (omegahat-RHTMLForms-251743f.zip) and it does not seem to install correctly.. I used the code install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip, type=win.binary, repos=NULL) Any suggestion of what could be causing these problems? On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: You don't need to use the getHTMLFormDescription() and createFunction(). Instead, you can use the postForm() call. However, getHTMLFormDescription(), etc. is more general. But you need the very latest version of the package to deal with degenerate forms that have no inputs (other than button clicks). You can get the latest version of the RHTMLForms package from github git clone g...@github.com:omegahat/RHTMLForms.git and that has the fixes for handling the degenerate forms with no arguments. D. On 9/19/12 7:51 AM, CPV wrote: Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site,
Re: [R] scraping with session cookies
Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18; library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18; postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] scraping with session cookies
Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] scraping with session cookies
You don't need to use the getHTMLFormDescription() and createFunction(). Instead, you can use the postForm() call. However, getHTMLFormDescription(), etc. is more general. But you need the very latest version of the package to deal with degenerate forms that have no inputs (other than button clicks). You can get the latest version of the RHTMLForms package from github git clone g...@github.com:omegahat/RHTMLForms.git and that has the fixes for handling the degenerate forms with no arguments. D. On 9/19/12 7:51 AM, CPV wrote: Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] scraping with session cookies
Thanks again, I run the script with the postForm(site, disclaimer_action=I Agree) and it does not seem to do anything, the webpage is still the disclaimer page thus I am getting the error below Error in function (classes, fdef, mtable) : unable to find an inherited method for function readHTMLTable, for signature NULL I also downloaded the latest version of RHTMLForms (omegahat-RHTMLForms-251743f.zip) and it does not seem to install correctly.. I used the code install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip, type=win.binary, repos=NULL) Any suggestion of what could be causing these problems? On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: You don't need to use the getHTMLFormDescription() and createFunction(). Instead, you can use the postForm() call. However, getHTMLFormDescription(), etc. is more general. But you need the very latest version of the package to deal with degenerate forms that have no inputs (other than button clicks). You can get the latest version of the RHTMLForms package from github git clone g...@github.com:omegahat/RHTMLForms.git and that has the fixes for handling the degenerate forms with no arguments. D. On 9/19/12 7:51 AM, CPV wrote: Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained,
Re: [R] scraping with session cookies
Try this, library(RCurl) library(XML) site- http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 URL-getURL(site) Text=htmlParse(URL,asText=T) This will give you all the web dat in an HTML-Text format. You can use getNodeSet function to extract whatever links or texts that you want from that page. I hope this helps. Best, Heramb On Wed, Sep 19, 2012 at 10:26 PM, CPV ceal...@gmail.com wrote: Thanks again, I run the script with the postForm(site, disclaimer_action=I Agree) and it does not seem to do anything, the webpage is still the disclaimer page thus I am getting the error below Error in function (classes, fdef, mtable) : unable to find an inherited method for function readHTMLTable, for signature NULL I also downloaded the latest version of RHTMLForms (omegahat-RHTMLForms-251743f.zip) and it does not seem to install correctly.. I used the code install.packages(C:/Users/cess/Downloads/omegahat-RHTMLForms-251743f.zip, type=win.binary, repos=NULL) Any suggestion of what could be causing these problems? On Wed, Sep 19, 2012 at 9:49 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: You don't need to use the getHTMLFormDescription() and createFunction(). Instead, you can use the postForm() call. However, getHTMLFormDescription(), etc. is more general. But you need the very latest version of the package to deal with degenerate forms that have no inputs (other than button clicks). You can get the latest version of the RHTMLForms package from github git clone g...@github.com:omegahat/RHTMLForms.git and that has the fixes for handling the degenerate forms with no arguments. D. On 9/19/12 7:51 AM, CPV wrote: Thank you for your help Duncan, I have been trying what you suggested however I am getting an error when trying to create the function fun- createFunction(forms[[1]]) it says Error in isHidden I hasDefault : operations are possible only for numeric, logical or complex types On Wed, Sep 19, 2012 at 12:15 AM, Duncan Temple Lang dtemplel...@ucdavis.edu wrote: Hi ? The key is that you want to use the same curl handle for both the postForm() and for getting the data document. site = u = http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 library(RCurl) curl = getCurlHandle(cookiefile = , verbose = TRUE) postForm(site, disclaimer_action=I Agree) Now we have the cookie in the curl handle so we can use that same curl handle to request the data document: txt = getURLContent(u, curl = curl) Now we can use readHTMLTable() on the local document content: library(XML) tt = readHTMLTable(txt, asText = TRUE, which = 1, stringsAsFactors = FALSE) Rather than knowing how to post the form, I like to read the form programmatically and generate an R function to do the submission for me. The RHTMLForms package can do this. library(RHTMLForms) forms = getHTMLFormDescription(u, FALSE) fun = createFunction(forms[[1]]) Then we can use fun(.curl = curl) instead of postForm(site, disclaimer_action=I Agree) This helps to abstract the details of the form. D. On 9/18/12 5:57 PM, CPV wrote: Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18 postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal,
[R] scraping with session cookies
Hi, I am starting coding in r and one of the things that i want to do is to scrape some data from the web. The problem that I am having is that I cannot get passed the disclaimer page (which produces a session cookie). I have been able to collect some ideas and combine them in the code below but I dont get passed the disclaimer page. I am trying to agree the disclaimer with the postForm and write the cookie to a file, but I cannot do it succesfully The webpage cookies are written to the file but the value is FALSE... So any ideas of what I should do or what I am doing wrong with? Thank you for your help, library(RCurl) library(XML) site - http://www.wateroffice.ec.gc.ca/graph/graph_e.html?mode=textstn=05ND012prm1=3syr=2012smo=09sday=15eyr=2012emo=09eday=18; postForm(site, disclaimer_action=I Agree) cf - cookies.txt no_cookie - function() { curlHandle - getCurlHandle(cookiefile=cf, cookiejar=cf) getURL(site, curl=curlHandle) rm(curlHandle) gc() } if ( file.exists(cf) == TRUE ) { file.create(cf) no_cookie() } allTables - readHTMLTable(site) allTables [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.