Re: InvertLinks Performance Nutch 1.6

Mattmann, Chris A (3980) Thu, 05 Feb 2015 09:34:13 -0800

WOW friggin awesome

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++







-----Original Message-----
From: Iain Lopata <ilopa...@hotmail.com>
Reply-To: "user@nutch.apache.org" <user@nutch.apache.org>
Date: Thursday, February 5, 2015 at 9:24 AM
To: "user@nutch.apache.org" <user@nutch.apache.org>
Subject: RE: InvertLinks Performance Nutch 1.6

>Reduced processing time from 40 minutes down to 30 seconds!  Thank you!
>
>-----Original Message-----
>From: Iain Lopata [mailto:ilopa...@hotmail.com]
>Sent: Monday, February 2, 2015 11:36 AM
>To: user@nutch.apache.org
>Subject: RE: InvertLinks Performance Nutch 1.6
>
>Thanks Sebastian -- I had not turned off filtering/normalization and did
>not appreciate they could be a significant contribution.  I will give
>that a try.
>
>-----Original Message-----
>From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
>Sent: Monday, February 2, 2015 11:32 AM
>To: user@nutch.apache.org
>Subject: Re: InvertLinks Performance Nutch 1.6
>
>Hi Iain,
>
>is the link inversion done with URL normalization/filtering.
>That could potentially take long if there are many links probably in
>combination with complex filters or long URLs (which make the regex
>filter slow).
>
>Filtering/normalization is on per default.
>You have to disable it explicitly via:
>% nutch invertlinks ... -noNormalize -noFilter
>
>Best,
>Sebastian
>
>
>
>2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>:
>
>> I am running the invertlinks step in my Nutch 1.6 based crawl process
>> on a single node.  I run invertlinks only because I need the Inlinks
>> in the indexer step so as to store them with the document.  I do not
>> need the anchor text and I am not scoring.  I am finding that
>> invertlinks (and more specifically the merge of the linkdb) takes a
>> long time - about 30 minutes for a crawl of around 150K documents.  I
>> am looking for ways that I might shorten this processing time.  Any
>>suggestions?
>>
>>
>>
>> I actually only need the Inlinks for a subset of my documents, which
>> could be identified either by a URL regex pattern match or by MIME
>> type.  This would be a case where a scoped filter for the invertlinks
>> step might be helpful, but I understand that scoping is only available
>>for normalizers.
>>
>>
>>
>>
>>
>> Thanks
>>
>>
>
>

Re: InvertLinks Performance Nutch 1.6

Reply via email to