I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.

$bin/nutch invertlinks crawl/linkdb -dir crawl/segments

$ bin/nutch index crawl/indexes crawl/crawldb/  crawl/linkdb
crawl/segments/20100415163946  crawl/segments/20100415164106

This seems to work for me. You may have already tried this workaround, but
just in case.

-Harry

On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius <mgris...@comcast.net>wrote:

> Two observations using the nutch 1.1. nightly build
> nutch-2010-04-14_04-00-47:
>
> 1) Previously I was using nutch 1.0 to crawl successfully, but had
> problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
> appears to parse all of the 'problem' pdfs that parse-pdf could not
> handle. The crawldb and segments directories are created and appear to
> be valid. However, the overall crawl does not finish now:
>
> nutch crawl urls/urls -dir crawl -depth 10
> ...
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20100415015102]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: filtering: true
> Generator: normalizing: true
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Exception in thread "main" java.lang.NullPointerException
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:133)
>
> Nutch 1.0 would complete like this:
>
> nutch crawl urls/urls -dir crawl -depth 10
> ...
> Generator: 0 records selected for fetching, exiting ...
> Stopping at depth=7 - no more URLs to fetch.
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937
> LinkDb: adding segment:
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656
> LinkDb: done
> Indexer: starting
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Dedup: done
> merging indexes to: crawl/index
> Adding
> file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-00000
> done merging
> crawl finished: crawl
>
> Any ideas?
>
>
> 2) if there is a 'space' in any component dir then $NUTCH_OPTS is
> invalid and causes this problem:
>
> m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch
> crawl urls/urls -dir crawl -depth 10 -topN 10
> NUTCH_OPTS:  -Dhadoop.log.dir=/home/mag/Desktop/untitled
> folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log
> -Djava.library.path=/home/mag/Desktop/untitled
> folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32
> Exception in thread "main" java.lang.NoClassDefFoundError:
> folder/nutch-2010-04-14_04-00-47/logs
> Caused by: java.lang.ClassNotFoundException:
> folder.nutch-2010-04-14_04-00-47.logs
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs.
> Program will exit.
> m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin>
>
> Obviously the work around is to rename 'untitled folder' to
> 'untitledFolderWithNoSpaces'
>
> Thanks, any help w/b appreciated w/ issue #1 above.
>
> -m.
>
>
>

Reply via email to