Re: InvertLinks Performance Nutch 1.6

Sebastian Nagel Mon, 02 Feb 2015 09:33:08 -0800

Hi Iain,

is the link inversion done with URL normalization/filtering.
That could potentially take long if there are many links
probably in combination with complex filters or long URLs
(which make the regex filter slow).


Filtering/normalization is on per default.
You have to disable it explicitly via:
% nutch invertlinks ... -noNormalize -noFilter

Best,
Sebastian



2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>:

> I am running the invertlinks step in my Nutch 1.6 based crawl process on a
> single node.  I run invertlinks only because I need the Inlinks in the
> indexer step so as to store them with the document.  I do not need the
> anchor text and I am not scoring.  I am finding that invertlinks (and more
> specifically the merge of the linkdb) takes a long time - about 30 minutes
> for a crawl of around 150K documents.  I am looking for ways that I might
> shorten this processing time.  Any suggestions?
>
>
>
> I actually only need the Inlinks for a subset of my documents, which could
> be identified either by a URL regex pattern match or by MIME type.  This
> would be a case where a scoped filter for the invertlinks step might be
> helpful, but I understand that scoping is only available for normalizers.
>
>
>
>
>
> Thanks
>
>

Re: InvertLinks Performance Nutch 1.6

Reply via email to