Hi Iain, is the link inversion done with URL normalization/filtering. That could potentially take long if there are many links probably in combination with complex filters or long URLs (which make the regex filter slow).
Filtering/normalization is on per default. You have to disable it explicitly via: % nutch invertlinks ... -noNormalize -noFilter Best, Sebastian 2015-01-29 23:20 GMT+01:00 Iain Lopata <ilopa...@hotmail.com>: > I am running the invertlinks step in my Nutch 1.6 based crawl process on a > single node. I run invertlinks only because I need the Inlinks in the > indexer step so as to store them with the document. I do not need the > anchor text and I am not scoring. I am finding that invertlinks (and more > specifically the merge of the linkdb) takes a long time - about 30 minutes > for a crawl of around 150K documents. I am looking for ways that I might > shorten this processing time. Any suggestions? > > > > I actually only need the Inlinks for a subset of my documents, which could > be identified either by a URL regex pattern match or by MIME type. This > would be a case where a scoped filter for the invertlinks step might be > helpful, but I understand that scoping is only available for normalizers. > > > > > > Thanks > >