WOW friggin awesome ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Iain Lopata <ilopa...@hotmail.com> Reply-To: "user@nutch.apache.org" <user@nutch.apache.org> Date: Thursday, February 5, 2015 at 9:24 AM To: "user@nutch.apache.org" <user@nutch.apache.org> Subject: RE: InvertLinks Performance Nutch 1.6 >Reduced processing time from 40 minutes down to 30 seconds! Thank you! > >-----Original Message----- >From: Iain Lopata [mailto:ilopa...@hotmail.com] >Sent: Monday, February 2, 2015 11:36 AM >To: user@nutch.apache.org >Subject: RE: InvertLinks Performance Nutch 1.6 > >Thanks Sebastian -- I had not turned off filtering/normalization and did >not appreciate they could be a significant contribution. I will give >that a try. > >-----Original Message----- >From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] >Sent: Monday, February 2, 2015 11:32 AM >To: user@nutch.apache.org >Subject: Re: InvertLinks Performance Nutch 1.6 > >Hi Iain, > >is the link inversion done with URL normalization/filtering. >That could potentially take long if there are many links probably in >combination with complex filters or long URLs (which make the regex >filter slow). > >Filtering/normalization is on per default. >You have to disable it explicitly via: >% nutch invertlinks ... -noNormalize -noFilter > >Best, >Sebastian > > > >2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>: > >> I am running the invertlinks step in my Nutch 1.6 based crawl process >> on a single node. I run invertlinks only because I need the Inlinks >> in the indexer step so as to store them with the document. I do not >> need the anchor text and I am not scoring. I am finding that >> invertlinks (and more specifically the merge of the linkdb) takes a >> long time - about 30 minutes for a crawl of around 150K documents. I >> am looking for ways that I might shorten this processing time. Any >>suggestions? >> >> >> >> I actually only need the Inlinks for a subset of my documents, which >> could be identified either by a URL regex pattern match or by MIME >> type. This would be a case where a scoped filter for the invertlinks >> step might be helpful, but I understand that scoping is only available >>for normalizers. >> >> >> >> >> >> Thanks >> >> > >