Hi Azhar, Can you explain why is that a botleneck for you ?
Thanks 25 May 2014 17:51 tarihinde "Azhar Jassal" <[email protected]> yazdı: > Hi > > I'm using Nutch 2.2.1 > > Each of the 4 jobs in the crawl cycle, as explained here need to reread the > entire webtable to get started: > http://wiki.apache.org/nutch/Nutch2Crawling > > This is a serious bottleneck for my use case. > > I know that the fetch and parse job can be combined via the Nutch config. > This removes the need for the parse job to be run separately- and therefore > the webtable does not to be read again. > > The page I linked to states that a future development might be combining > the generate and fetch stages so that only one read of the webtable is > required. > > Has anyone attempted to do is? Is there a patch out there for a combined > generator and fetch job? > > Thanks > > Az >

