Hi,

I’m running Nutch 2 on a 2-node Hadoop cluster. I’m also running Solr 4 on a 
separate machine accessible by private IP. I run the crawl command by doing the 
following.

bin/crawl urls/seed.txt TestCrawl <solrUrl> 2

My problem is that no URLs are fetched. And thus, nothing is indexed. When I 
run stats, this is what I get

{db_stats-job_201405261214_0043=
        {
                jobID=job_201405261214_0043,
                jobName=db_stats,
                counters=
                        {File Input Format Counters ={BYTES_READ=0},
                        Job Counters ={TOTAL_LAUNCHED_REDUCES=1, 
SLOTS_MILLIS_MAPS=7990, FALLOW_SLOTS_MILLIS_REDUCES=0, 
FALLOW_SLOTS_MILLIS_MAPS=0, TOTAL_LAUNCHED_MAPS=1, SLOTS_MILLIS_REDUCES=9980},
                        Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, 
MAP_INPUT_RECORDS=0, REDUCE_SHUFFLE_BYTES=6, SPILLED_RECORDS=0, 
MAP_OUTPUT_BYTES=0, COMMITTED_HEAP_BYTES=218103808, CPU_MILLISECONDS=1950, 
SPLIT_RAW_BYTES=1017, COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, 
REDUCE_INPUT_GROUPS=0, COMBINE_OUTPUT_RECORDS=0, 
PHYSICAL_MEMORY_BYTES=296411136, REDUCE_OUTPUT_RECORDS=0, 
VIRTUAL_MEMORY_BYTES=2251104256, MAP_OUTPUT_RECORDS=0}, 
FileSystemCounters={FILE_BYTES_READ=6, HDFS_BYTES_READ=1017, 
FILE_BYTES_WRITTEN=156962, HDFS_BYTES_WRITTEN=86}, File Output Format Counters 
={BYTES_WRITTEN=86}}}}
14/05/26 23:12:34 INFO crawl.WebTableReader: TOTAL urls:        0
14/05/26 23:12:34 INFO crawl.WebTableReader: WebTable statistics: done

What am I missing? My regex and normalise filters are allowing all URL 
patterns. I’m trying to do a whole web crawl.

-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople

Reply via email to