Re: [Nutch-general] Crawl www.yahoo.com with nutch

Meryl Silverburgh Sun, 15 Apr 2007 21:08:33 -0700

I am using 0.9 too.

I am now getting further, but I get a bunch of NullPointerException:


fetch of http://www.yahoo.com/s/557760 failed with:
java.lang.NullPointerException
fetch of http://www.yahoo.com/r/hq failed with: java.lang.NullPointerException
fetch of http://www.yahoo.com/s/557762 failed with:
java.lang.NullPointerException


On 4/15/07, songjue <[EMAIL PROTECTED]> wrote:

I try this, Nutch0.9 works just fine. What's your Nutch version?



songjue
2007-04-16



发件人： Meryl Silverburgh
发送时间： 2007-04-16 11:33:05
收件人： [EMAIL PROTECTED]
抄送：
主题： Crawl www.yahoo.com with nutch

I setup nutch to crawl, in my input file, I only have 1 site
"http://www.yahoo.com";

$ bin/nutch crawl urls -dir crawl -depth 3

and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt

# accept hosts in MY.DOMAIN.NAME
+^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/

But no links is being fetched, when I change the link to www.cnn.com,
it works. Can you please tell me what do I need to work to make
www.yahoo.com works?

$ bin/nutch crawl urls -dir crawl -depth 3
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070415222440
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20070415222440
Fetcher: threads: 10
fetching http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20070415222440]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20070415222449
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20070415222440
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20070415222440
 Indexing [http://www.yahoo.com/] with analyzer
[EMAIL PROTECTED] (null)
Optimizing index.
merging segments _ram_0 (1 docs) into _0 (1 docs)
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Dedup: done
merging indexes to: crawl/index
Adding crawl/indexes/part-00000
done merging
crawl finished: crawl

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawl www.yahoo.com with nutch

Reply via email to