Yeah, impressive.

The defaults are not really optimal
for production crawls where it's unlikely
that URL filter / normalization rules get
changed somewhere in between the steps
of a running crawl.

Ideally, URL should be filtered / normalized
only if new URLs are added to CrawlDb:
- seeds
- outlinks
- redirects
But there may be other opinions with better
arguments? Are there any?

Regarding the outlinks I'm not sure whether
it's better to do normalization and filtering
during the parse job or when updating CrawlDb.

Feel free to continue the discussion or open
a Jira to improve the default configuration.

Thanks,
Sebastian


On 02/05/2015 06:24 PM, Iain Lopata wrote:
> Reduced processing time from 40 minutes down to 30 seconds!  Thank you!
> 
> -----Original Message-----
> From: Iain Lopata [mailto:ilopa...@hotmail.com] 
> Sent: Monday, February 2, 2015 11:36 AM
> To: user@nutch.apache.org
> Subject: RE: InvertLinks Performance Nutch 1.6
> 
> Thanks Sebastian -- I had not turned off filtering/normalization and did not 
> appreciate they could be a significant contribution.  I will give that a try.
> 
> -----Original Message-----
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Monday, February 2, 2015 11:32 AM
> To: user@nutch.apache.org
> Subject: Re: InvertLinks Performance Nutch 1.6
> 
> Hi Iain,
> 
> is the link inversion done with URL normalization/filtering.
> That could potentially take long if there are many links probably in 
> combination with complex filters or long URLs (which make the regex filter 
> slow).
> 
> Filtering/normalization is on per default.
> You have to disable it explicitly via:
> % nutch invertlinks ... -noNormalize -noFilter
> 
> Best,
> Sebastian
> 
> 
> 
> 2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>:
> 
>> I am running the invertlinks step in my Nutch 1.6 based crawl process 
>> on a single node.  I run invertlinks only because I need the Inlinks 
>> in the indexer step so as to store them with the document.  I do not 
>> need the anchor text and I am not scoring.  I am finding that 
>> invertlinks (and more specifically the merge of the linkdb) takes a 
>> long time - about 30 minutes for a crawl of around 150K documents.  I 
>> am looking for ways that I might shorten this processing time.  Any 
>> suggestions?
>>
>>
>>
>> I actually only need the Inlinks for a subset of my documents, which 
>> could be identified either by a URL regex pattern match or by MIME 
>> type.  This would be a case where a scoped filter for the invertlinks 
>> step might be helpful, but I understand that scoping is only available for 
>> normalizers.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
> 
> 

Reply via email to