On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote:
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: 0 records selected for fetching, exiting ...
> Exception in thread "main" java.lang.NullPointerException
> at org.apache.nutch.crawl.Crawl.main(Cr
Hi Harry,
Yes indeed. It appears to work for me too. Thank you!
nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/crawl/segments/20100415221103
LinkDb: adding segment:
I am new to nutch and still trying to figure out the code flow, however, as
a work around to issue #1, after the crawl finishes you could run linkdb and
index command separately from cygwin.
$bin/nutch invertlinks crawl/linkdb -dir crawl/segments
$ bin/nutch index crawl/indexes crawl/crawldb/ cr
Two observations using the nutch 1.1. nightly build
nutch-2010-04-14_04-00-47:
1) Previously I was using nutch 1.0 to crawl successfully, but had
problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which
appears to parse all of the 'problem' pdfs that parse-pdf could not
handle. The
I have an old page on my site that Nutch is fetching. The results in the
Nutch web app look like this:
Site Map
... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by
Scientific Name ... Common Name Select
NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugs