I setup nutch to crawl, in my input file, I only have 1 site "http://www.yahoo.com"
$ bin/nutch crawl urls -dir crawl -depth 3 and I have added 'yahoo.com' as my domain name in crawl-urlfilter.txt # accept hosts in MY.DOMAIN.NAME +^http://([a-zA-Z0-9]*\.)*(cnn.com|yahoo.com)/ But no links is being fetched, when I change the link to www.cnn.com, it works. Can you please tell me what do I need to work to make www.yahoo.com works? $ bin/nutch crawl urls -dir crawl -depth 3 crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 Injector: starting Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070415222440 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawl/segments/20070415222440 Fetcher: threads: 10 fetching http://www.yahoo.com/ Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20070415222440] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl/segments/20070415222449 Generator: filtering: false Generator: topN: 2147483647 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=1 - no more URLs to fetch. LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20070415222440 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20070415222440 Indexing [http://www.yahoo.com/] with analyzer [EMAIL PROTECTED] (null) Optimizing index. merging segments _ram_0 (1 docs) into _0 (1 docs) Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Dedup: done merging indexes to: crawl/index Adding crawl/indexes/part-00000 done merging crawl finished: crawl ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
