Hi Azhar,

Can you explain why is that a botleneck for you ?

Thanks
25 May 2014 17:51 tarihinde "Azhar Jassal" <[email protected]> yazdı:

> Hi
>
> I'm using Nutch 2.2.1
>
> Each of the 4 jobs in the crawl cycle, as explained here need to reread the
> entire webtable to get started:
> http://wiki.apache.org/nutch/Nutch2Crawling
>
> This is a serious bottleneck for my use case.
>
> I know that the fetch and parse job can be combined via the Nutch config.
> This removes the need for the parse job to be run separately- and therefore
> the webtable does not to be read again.
>
> The page I linked to states that a future development might be combining
> the generate and fetch stages so that only one read of the webtable is
> required.
>
> Has anyone attempted to do is? Is there a patch out there for a combined
> generator and fetch job?
>
> Thanks
>
> Az
>

Reply via email to