Hi,

I am a newbie. Please assist!
I am using cygwin (windows xp) and Nutch 0.8.1.

In crawl-urlfilter.txt, I modified:
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\.)*cnn.com/

$ mkdir urls
$ echo 'http://www.cnn.com"; > urls/seeds.txt
$ nutch crawl urls -dir db -depth 1 -topN 10

I got the following error:
[EMAIL PROTECTED] /cygdrive/d/corpus/data
$ nutch crawl urls -dir db -depth 1 -threads 1 -topN 10
crawl started in: db
rootUrlDir = urls
threads = 1
depth = 1
topN = 10
Injector: starting
Injector: crawlDb: db/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: db/segments/20061026061130
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: db/segments/20061026061130
Fetcher: threads: 1
fetching http://www.cnn.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: db/crawldb
CrawlDb update: segment: db/segments/20061026061130
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: db/linkdb
LinkDb: adding segment: db/segments/20061026061130
LinkDb: done
Indexer: starting
Indexer: linkdb: db/linkdb
Indexer: adding segment: db/segments/20061026061130
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
       at org.apache.nutch.indexer.Indexer.index(Indexer.java:296)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:121)

Help!!!


Regards,
Haward

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to