RE: InvertLinks Performance Nutch 1.6

Iain Lopata Mon, 02 Feb 2015 09:37:55 -0800

Thanks Sebastian -- I had not turned off filtering/normalization and did not 
appreciate they could be a significant contribution.  I will give that a try.


-----Original Message-----
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Monday, February 2, 2015 11:32 AM
To: user@nutch.apache.org
Subject: Re: InvertLinks Performance Nutch 1.6

Hi Iain,

is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links probably in 
combination with complex filters or long URLs (which make the regex filter 
slow).

Filtering/normalization is on per default.
You have to disable it explicitly via:
% nutch invertlinks ... -noNormalize -noFilter

Best,
Sebastian



2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>:

> I am running the invertlinks step in my Nutch 1.6 based crawl process 
> on a single node.  I run invertlinks only because I need the Inlinks 
> in the indexer step so as to store them with the document.  I do not 
> need the anchor text and I am not scoring.  I am finding that 
> invertlinks (and more specifically the merge of the linkdb) takes a 
> long time - about 30 minutes for a crawl of around 150K documents.  I 
> am looking for ways that I might shorten this processing time.  Any 
> suggestions?
>
>
>
> I actually only need the Inlinks for a subset of my documents, which 
> could be identified either by a URL regex pattern match or by MIME 
> type.  This would be a case where a scoped filter for the invertlinks 
> step might be helpful, but I understand that scoping is only available for 
> normalizers.
>
>
>
>
>
> Thanks
>
>

RE: InvertLinks Performance Nutch 1.6

Reply via email to