RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John
In getting Nutch 1.2 up and running with Hadoop should I be using the tutorial on the Nutch wiki (1), this tutorial link (2) or the link to the hadoop cluster setup (3) I am using 2 desktops (one running Vista and one running XP) with Cygwin to execute commands and wish to experiment running a

Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy
Hi, My application needs to crawl a set of urls which I give to the urls directory and fetch only the contents of that urls only. I am not interested in the contents of the internal or external links. So I have run the crawl command by giving depth as 1. bin/nutch crawl urls -dir crawl -depth 1

Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy
Hi, My application needs to crawl a set of urls which I give to the urls directory and fetch only the contents of that urls only. I am not interested in the contents of the internal or external links. So I have run the crawl command by giving depth as 1. bin/nutch crawl urls -dir crawl -depth 1

RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John
Hi Tanguy Using your first suggestion I am prompted for a password (which I have not set up after creating the private and public ssh keys using command ssh-keygen -t dsa, I left the password blank). Regarding your alternative suggestion of using ping I recieve output that 4 packets sent, 4

CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost
This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. Call for Presentations Berlin Buzzwords

Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.
Thanks Chris, Charan and Alex. I am looking into the crawl statistics now. And I see fields like db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do they mean? And, I also see the db_unfetched is way too high than the db_fetched. Does it mean most of the pages did not

Re: Few questions from a newbie

2011-01-25 Thread Markus Jelsma
These values come from the CrawlDB and have the following meaning. db_unfetched This is the number of URL's that are to be crawled when the next batch is started. This number is usually limited with the generate.max.per.host setting. So, if there are 5000 unfetched and generate.max.per.host is

Re: Regarding crawling of short URL's

2011-01-25 Thread Markus Jelsma
Reading a URL from the DB returns the HTTP response of that URL, some header information and body. Crawling a URL with a HTTP redirect won't result in the HTTP response of the redirection target for that redirecting URL. Hi, My application needs to crawl a set of urls which I give to the

Archiving Audio and Video

2011-01-25 Thread Adam Estrada
Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would require some kind of transcription which may be out of reach on this project. Thanks,

Re: Archiving Audio and Video

2011-01-25 Thread Gora Mohanty
On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would