Re: [R] Developing a web crawler

2011-03-29 Thread antujsrv
Hi Stefan,

Thanks for the links you shared in the post, but i am unable to access the
scripts and output. It requires a password. 
If you can let me know the password for the .rar file of the scripts_other
5, it would be really helpful. 
thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3414627.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler

2011-03-14 Thread Evanescence
Can i ask a question Do I need a good math for developing a web crawler ? 
( I want to develop a simple web crawler to do something )

--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3353291.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Developing a web crawler

2011-03-03 Thread antujsrv
Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w -
getURL(http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes;)
write.table(w,test.txt)
t - readLines(w) 

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler

2011-03-03 Thread rex.dwyer
Perl seems like a 10x better choice for the task, but try looking at the 
examples in ?strsplit to get started.

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of antujsrv
Sent: Thursday, March 03, 2011 4:23 AM
To: r-help@r-project.org
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go
about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w -
getURL(http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes;)
write.table(w,test.txt)
t - readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




message may contain confidential information. If you are not the designated 
recipient, please notify the sender immediately, and delete the original and 
any copies. Any use of the message by you is prohibited. 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler

2011-03-03 Thread Alexy Khrabrov

On Mar 3, 2011, at 4:22 AM, antujsrv wrote:
 
 I wish to develop a web crawler in R.

As Rex said, there are faster languages, but R string processing got better due 
to the stringr package (R Journal 2010-2).  When Hadley is done with it, it 
will be like having it all in R!

-- Alexy
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler / R webkit or something similar?

2011-03-03 Thread Mike Marchywka







 Date: Thu, 3 Mar 2011 01:22:44 -0800
 From: antuj...@gmail.com
 To: r-help@r-project.org
 Subject: [R] Developing a web crawler

 Hi,

 I wish to develop a web crawler in R. I have been using the functionalities
 available under the RCurl package.
 I am able to extract the html content of the site but i don't know how to go

In general this can be a big effort but there may be things in 
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a webkit
package or other open source browser with or without an R interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser. 


Now that you mention it, I wonder if there are browser plugins to handle
R content ( I'd have to give this some thought, put a script up as
a web page with mime type test/R and have it execute it in R. )



 about analyzing the html formatted document.
 I wish to know the frequency of a word in the document. I am only acquainted
 with analyzing data sets.
 So how should i go about analyzing data that is not available in table
 format.

 Few chunks of code that i wrote:
 w -
 getURL(http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes;)
 write.table(w,test.txt)
 t - readLines(w)

 readLines also didnt prove out to be of any help.

 Any help would be highly appreciated. Thanks in advance.


 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler

2011-03-03 Thread Stefan Th. Gries
Hi

The book whose companion website is here
http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/qclwr.html
deals with many of the things you need for a web crawler, and
assignment other 5 on that site
(http://www.linguistics.ucsb.edu/faculty/stgries/research/qclwr/other_5.pdf)
is a web crawler.

Best,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Developing a web crawler / R webkit or something similar? [off topic]

2011-03-03 Thread Matt Shotwell

On 03/03/2011 08:07 AM, Mike Marchywka wrote:









Date: Thu, 3 Mar 2011 01:22:44 -0800
From: antuj...@gmail.com
To: r-help@r-project.org
Subject: [R] Developing a web crawler

Hi,

I wish to develop a web crawler in R. I have been using the functionalities
available under the RCurl package.
I am able to extract the html content of the site but i don't know how to go


In general this can be a big effort but there may be things in
text processing packages you could adapt to execute html and javascript.
However, I guess what I'd be looking for is something like a webkit
package or other open source browser with or without an R interface.
This actually may be an ideal solution for a lot of things as you get
all the content handlers of at least some browser.


Now that you mention it, I wonder if there are browser plugins to handle
R content ( I'd have to give this some thought, put a script up as
a web page with mime type test/R and have it execute it in R. )


There are server-side solutions for this sort of thing. See 
http://rapache.net/ . Also, there was a string of messages on R-devel 
some years ago addressing the mime type issue; beginning here: 
http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't 
know whether there was a resolution. Some suggestions were text/x-R, 
text/x-Rd, application/x-RData.


-Matt






about analyzing the html formatted document.
I wish to know the frequency of a word in the document. I am only acquainted
with analyzing data sets.
So how should i go about analyzing data that is not available in table
format.

Few chunks of code that i wrote:
w-
getURL(http://www.amazon.com/Kindle-Wireless-Reader-Wifi-Graphite/dp/B003DZ1Y8Q/ref=dp_reviewsanchor#FullQuotes;)
write.table(w,test.txt)
t- readLines(w)

readLines also didnt prove out to be of any help.

Any help would be highly appreciated. Thanks in advance.


--
View this message in context: 
http://r.789695.n4.nabble.com/Developing-a-web-crawler-tp3332993p3332993.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Matthew S Shotwell   Assistant Professor   School of Medicine
 Department of Biostatistics   Vanderbilt University

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.