Hi,
I’m running Nutch 2 on a 2-node Hadoop cluster. I’m also running Solr 4 on a
separate machine accessible by private IP. I run the crawl command by doing the
following.
bin/crawl urls/seed.txt TestCrawl <solrUrl> 2
My problem is that no URLs are fetched. And thus, nothing is indexed. When I
run stats, this is what I get
{db_stats-job_201405261214_0043=
{
jobID=job_201405261214_0043,
jobName=db_stats,
counters=
{File Input Format Counters ={BYTES_READ=0},
Job Counters ={TOTAL_LAUNCHED_REDUCES=1,
SLOTS_MILLIS_MAPS=7990, FALLOW_SLOTS_MILLIS_REDUCES=0,
FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=9980},
Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6,
MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0,
MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=218103808, CPU_MILLISECONDS=1950,
SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0,
REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0,
PHYSICAL_MEMORY_BYTES=296411136, REDUCE_OUTPUT_RECORDS=0,
VIRTUAL_MEMORY_BYTES=2251104256, MAP_OUTPUT_RECORDS=0},
FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017,
FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters
={BYTES_WRITTEN=86}}}}
14/05/26 23:12:34 INFO crawl.WebTableReader: TOTAL urls: 0
14/05/26 23:12:34 INFO crawl.WebTableReader: WebTable statistics: done
What am I missing? My regex and normalise filters are allowing all URL
patterns. I’m trying to do a whole web crawl.
--
Manikandan Saravanan
Architect - Technology
TheSocialPeople