Yeah, impressive. The defaults are not really optimal for production crawls where it's unlikely that URL filter / normalization rules get changed somewhere in between the steps of a running crawl.
Ideally, URL should be filtered / normalized only if new URLs are added to CrawlDb: - seeds - outlinks - redirects But there may be other opinions with better arguments? Are there any? Regarding the outlinks I'm not sure whether it's better to do normalization and filtering during the parse job or when updating CrawlDb. Feel free to continue the discussion or open a Jira to improve the default configuration. Thanks, Sebastian On 02/05/2015 06:24 PM, Iain Lopata wrote: > Reduced processing time from 40 minutes down to 30 seconds! Thank you! > > -----Original Message----- > From: Iain Lopata [mailto:ilopa...@hotmail.com] > Sent: Monday, February 2, 2015 11:36 AM > To: user@nutch.apache.org > Subject: RE: InvertLinks Performance Nutch 1.6 > > Thanks Sebastian -- I had not turned off filtering/normalization and did not > appreciate they could be a significant contribution. I will give that a try. > > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Monday, February 2, 2015 11:32 AM > To: user@nutch.apache.org > Subject: Re: InvertLinks Performance Nutch 1.6 > > Hi Iain, > > is the link inversion done with URL normalization/filtering. > That could potentially take long if there are many links probably in > combination with complex filters or long URLs (which make the regex filter > slow). > > Filtering/normalization is on per default. > You have to disable it explicitly via: > % nutch invertlinks ... -noNormalize -noFilter > > Best, > Sebastian > > > > 2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>: > >> I am running the invertlinks step in my Nutch 1.6 based crawl process >> on a single node. I run invertlinks only because I need the Inlinks >> in the indexer step so as to store them with the document. I do not >> need the anchor text and I am not scoring. I am finding that >> invertlinks (and more specifically the merge of the linkdb) takes a >> long time - about 30 minutes for a crawl of around 150K documents. I >> am looking for ways that I might shorten this processing time. Any >> suggestions? >> >> >> >> I actually only need the Inlinks for a subset of my documents, which >> could be identified either by a URL regex pattern match or by MIME >> type. This would be a case where a scoped filter for the invertlinks >> step might be helpful, but I understand that scoping is only available for >> normalizers. >> >> >> >> >> >> Thanks >> >> > >