Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Phil Barnett
On Thu, 2010-04-15 at 15:34 -0400, matthew a. grisius wrote: > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.crawl.Crawl.main(Cr

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Hi Harry, Yes indeed. It appears to work for me too. Thank you! nutch invertlinks crawl/linkdb -dir crawl/segments LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: file:/crawl/segments/20100415221103 LinkDb: adding segment:

Re: nutch 1.1 crawl d/n complete issue

2010-04-15 Thread Harry Nutch
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ cr

nutch 1.1 crawl d/n complete issue

2010-04-15 Thread matthew a. grisius
Two observations using the nutch 1.1. nightly build nutch-2010-04-14_04-00-47: 1) Previously I was using nutch 1.0 to crawl successfully, but had problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which appears to parse all of the 'problem' pdfs that parse-pdf could not handle. The

Weird crawl issue. Nutch picking up drop-down menu options.

2010-04-15 Thread tsmori
I have an old page on my site that Nutch is fetching. The results in the Nutch web app look like this: Site Map ... INSECT SYSTEMATIC RESOURCES Home : Site Map search Resources by Scientific Name ... Common Name Select NameAlderfliesAntsAntlionsAphidsBarkliceBeesBeetlesBookliceBristletailsBugs