Hi

I'm using Nutch 2.2.1

Each of the 4 jobs in the crawl cycle, as explained here need to reread the
entire webtable to get started: http://wiki.apache.org/nutch/Nutch2Crawling

This is a serious bottleneck for my use case.

I know that the fetch and parse job can be combined via the Nutch config.
This removes the need for the parse job to be run separately- and therefore
the webtable does not to be read again.

The page I linked to states that a future development might be combining
the generate and fetch stages so that only one read of the webtable is
required.

Has anyone attempted to do is? Is there a patch out there for a combined
generator and fetch job?

Thanks

Az

Reply via email to