In getting Nutch 1.2 up and running with Hadoop should I be using the tutorial
on the Nutch wiki (1), this tutorial link (2) or the link to the hadoop cluster
setup (3)
I am using 2 desktops (one running Vista and one running XP) with Cygwin to
execute commands and wish to experiment running a
Hi,
My application needs to crawl a set of urls which I give to the urls
directory and fetch only the contents of that urls only.
I am not interested in the contents of the internal or external links.
So I have run the crawl command by giving depth as 1.
bin/nutch crawl urls -dir crawl -depth 1
Hi,
My application needs to crawl a set of urls which I give to the urls
directory and fetch only the contents of that urls only.
I am not interested in the contents of the internal or external links.
So I have run the crawl command by giving depth as 1.
bin/nutch crawl urls -dir crawl -depth 1
Hi Tanguy
Using your first suggestion I am prompted for a password (which I have not set
up after creating the private and public ssh keys using command ssh-keygen -t
dsa, I left the password blank). Regarding your alternative suggestion of using
ping I recieve output that 4 packets sent, 4
This is to announce the Berlin Buzzwords 2011. The second edition of the
successful conference on scalable and open search, data processing and data
storage in Germany, taking place in Berlin.
Call for Presentations Berlin Buzzwords
Thanks Chris, Charan and Alex.
I am looking into the crawl statistics now. And I see fields like
db_unfetched, db_fetched, db_gone, db_redir-temp and db_redir_perm, what do
they mean?
And, I also see the db_unfetched is way too high than the db_fetched. Does
it mean most of the pages did not
These values come from the CrawlDB and have the following meaning.
db_unfetched
This is the number of URL's that are to be crawled when the next batch is
started. This number is usually limited with the generate.max.per.host
setting. So, if there are 5000 unfetched and generate.max.per.host is
Reading a URL from the DB returns the HTTP response of that URL, some header
information and body. Crawling a URL with a HTTP redirect won't result in the
HTTP response of the redirection target for that redirecting URL.
Hi,
My application needs to crawl a set of urls which I give to the
Curious...I have been using Nutch for a while now and have never tried to index
any audio or video formats. Is it feasible to grab the audio out of both forms
of media and then index it? I believe this would require some kind of
transcription which may be out of reach on this project.
Thanks,
On Wed, Jan 26, 2011 at 9:15 AM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
Curious...I have been using Nutch for a while now and have never tried to
index any audio or video formats. Is it feasible to grab the audio out of
both forms of media and then index it? I believe this would
10 matches
Mail list logo