nutch-1.0, hadoop-0.19.1, no urls to fetch when crawling

Xudong Du Fri, 05 Jun 2009 07:51:32 -0700

hi, all.
when i run nutch-1.0 to crawl on hadoop-0.19.1 by setting nutch-site.config,
i met such problem:


2009-06-05 06:46:31,012 WARN  crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2009-06-05 06:46:31,028 INFO  crawl.Crawl - Stopping at depth=0 - no more
URLs to fetch.
2009-06-05 06:46:31,028 WARN  crawl.Crawl - No URLs to fetch - check your
seed list and URL filters.

I run "bin/hadoop -put urls urls" to dfs and "bin/hadoop -get urls ." to
check that in urls directory the seed.txt does exist and not blank. and i
also set the crawl-urlfilter.txt to let the "my.domain.com" changed.
when i set nutch-site.config to let it run locally instead of hadoop, it
works. however when runing on hadoop, it comes to "NO URL to fetch".

when i search the reason, I found that nutch-0.9 used to have a bug which
can cause this problem, but when i check the patch file, i  found that
nutch-1.0 has already added the patch.

I am very confused and looking forward your help.

thank you very much.


-- 
Yours Sincerely
Xudong Du
Zijing 2# 305A
Tsinghua University, Beijing, China, 100084

nutch-1.0, hadoop-0.19.1, no urls to fetch when crawling

Reply via email to