I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius <mgris...@comcast.net>wrote: > Two observations using the nutch 1.1. nightly build > nutch-2010-04-14_04-00-47: > > 1) Previously I was using nutch 1.0 to crawl successfully, but had > problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which > appears to parse all of the 'problem' pdfs that parse-pdf could not > handle. The crawldb and segments directories are created and appear to > be valid. However, the overall crawl does not finish now: > > nutch crawl urls/urls -dir crawl -depth 10 > ... > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20100415015102] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) > > Nutch 1.0 would complete like this: > > nutch crawl urls/urls -dir crawl -depth 10 > ... > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=7 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 > LinkDb: done > Indexer: starting > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Dedup: done > merging indexes to: crawl/index > Adding > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000 > done merging > crawl finished: crawl > > Any ideas? > > > 2) if there is a 'space' in any component dir then $NUTCH_OPTS is > invalid and causes this problem: > > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch > crawl urls/urls -dir crawl -depth 10 -topN 10 > NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled > folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log > -Djava.library.path=/home/mag/Desktop/untitled > folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 > Exception in thread "main" java.lang.NoClassDefFoundError: > folder/nutch-2010-04-14_04-00-47/logs > Caused by: java.lang.ClassNotFoundException: > folder.nutch-2010-04-14_04-00-47.logs > at java.net.URLClassLoader$1.run(URLClassLoader.java:200) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:188) > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:252) > at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. > Program will exit. > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> > > Obviously the work around is to rename 'untitled folder' to > 'untitledFolderWithNoSpaces' > > Thanks, any help w/b appreciated w/ issue #1 above. > > -m. > > >