[ https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645802#comment-13645802 ]
Tejas Patil commented on NUTCH-649: ----------------------------------- Hi [~lewismc], Now thats an awesome idea... certainly better than lame debug statements. I would get that done soon. > Log list of files found but not crawled. > ---------------------------------------- > > Key: NUTCH-649 > URL: https://issues.apache.org/jira/browse/NUTCH-649 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Environment: any > Reporter: Jim > Fix For: 1.7, 2.2 > > Attachments: NUTCH-649.2.x.patch, NUTCH-649.trunk.patch > > > I use Nutch to find the location of executables on the web, but we do > not download the executables with Nutch. In order to get nutch to give the > location of files without downloading the files, I had to make a very small > patch to the code, but I think this change might be useful to others also. > The patch just logs files that are being filtered at the info level, although > perhaps it should be at the debug level. > I have included a svn diff with this change. Use cases would be to both > use as a diagnostic tool (let's see what we are skipping) as well as a way to > find content and links pointed to by a page or site without having to > actually download that content. > Index: ParseOutputFormat.java > =================================================================== > --- ParseOutputFormat.java (revision 593619) > +++ ParseOutputFormat.java (working copy) > @@ -193,17 +193,20 @@ > toHost = null; > } > if (toHost == null || !toHost.equals(fromHost)) { // external > links > + LOG.info("filtering externalLink " + toUrl + " linked to by " > + fromUrl); > + > continue; // skip it > } > } > try { > toUrl = normalizers.normalize(toUrl, > URLNormalizers.SCOPE_OUTLINK); // normalize the url > - toUrl = filters.filter(toUrl); // filter the url > - if (toUrl == null) { > - continue; > - } > - } catch (Exception e) { > + > + if (filters.filter(toUrl) == null) { // filter the url > + LOG.info("filtering content " + toUrl + " linked to by > " + fromUrl); > + continue; > + } > + } catch (Exception e) { > continue; > } > CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, > interval); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira