The Fix.
In line 131 of Crawl.java
Generate no longer returns segments like it used to. Now it returns segs.
line 131 needs to read
If (segs == null)
Instead of the current
If (segments == null)
After that change and a recompile, crawl is working just fine.
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) I was using nutch 1.0 to crawl successfully, but had problems w/
parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to
parse all of the 'problem' pdfs that parse-pdf could not handle. The
crawldb and
On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote:
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.crawl.Crawl.main(Cr
Hi Harry,
Yes indeed. It appears to work for me too. Thank you!
nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding segment:
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb/ cr
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The