date:20110125

RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John

In getting Nutch 1.2 up and running with Hadoop should I be using the tutorial on the Nutch wiki (1), this tutorial link (2) or the link to the hadoop cluster setup (3) I am using 2 desktops (one running Vista and one running XP) with Cygwin to execute commands and wish to experiment running a

Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy

Hi, My application needs to crawl a set of urls which I give to the urls directory and fetch only the contents of that urls only. I am not interested in the contents of the internal or external links. So I have run the crawl command by giving depth as 1. bin/nutch crawl urls -dir crawl -depth 1

Regarding crawling of short URL's

2011-01-25 Thread Arjun Kumar Reddy

Hi, My application needs to crawl a set of urls which I give to the urls directory and fetch only the contents of that urls only. I am not interested in the contents of the internal or external links. So I have run the crawl command by giving depth as 1. bin/nutch crawl urls -dir crawl -depth 1

RE: Hadoop Tutorial

2011-01-25 Thread McGibbney, Lewis John

Hi Tanguy Using your first suggestion I am prompted for a password (which I have not set up after creating the private and public ssh keys using command ssh-keygen -t dsa, I left the password blank). Regarding your alternative suggestion of using ping I recieve output that 4 packets sent, 4

CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost

This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. Call for Presentations Berlin Buzzwords

Re: Few questions from a newbie

2011-01-25 Thread .: Abhishek :.

Thanks Chris, Charan and Alex. I am looking into the crawl statistics now. And I see fields like db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do they mean? And, I also see the db_unfetched is way too high than the db_fetched. Does it mean most of the pages did not

Re: Few questions from a newbie

2011-01-25 Thread Markus Jelsma

These values come from the CrawlDB and have the following meaning. db_unfetched This is the number of URL's that are to be crawled when the next batch is started. This number is usually limited with the generate.max.per.host setting. So, if there are 5000 unfetched and generate.max.per.host is

Re: Regarding crawling of short URL's

2011-01-25 Thread Markus Jelsma

Reading a URL from the DB returns the HTTP response of that URL, some header information and body. Crawling a URL with a HTTP redirect won't result in the HTTP response of the redirection target for that redirecting URL. Hi, My application needs to crawl a set of urls which I give to the

Archiving Audio and Video

2011-01-25 Thread Adam Estrada

Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would require some kind of transcription which may be out of reach on this project. Thanks,

Re: Archiving Audio and Video

2011-01-25 Thread Gora Mohanty

On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Curious...I have been using Nutch for a while now and have never tried to index any audio or video formats. Is it feasible to grab the audio out of both forms of media and then index it? I believe this would

RE: Hadoop Tutorial

Regarding crawling of short URL's

Regarding crawling of short URL's

RE: Hadoop Tutorial

CFP - Berlin Buzzwords 2011 - Search, Score, Scale

Re: Few questions from a newbie

Re: Few questions from a newbie

Re: Regarding crawling of short URL's

Archiving Audio and Video

Re: Archiving Audio and Video

10 matches

Site Navigation

Mail list logo

Footer information