Hi Robert, 404s are recorded in the CrawlDb after the tool "updatedb" is called. Could you share the commands you're running? Please also have a look into the log files (esp. the hadoop.log) - all fetches are logged and also whether fetches have failed. If you cannot find a log message for the broken links, it might be that the URLs are filtered. In this case, please also share the configuration (if different from the default).
Best, Sebastian On 3/2/20 11:11 PM, Robert Scavilla wrote: > Nutch 1.14: > I am looking at the FetcherThread code. The 404 url does get flagged with > a ProtocolStatus.NOTFOUND, but the broken link never gets to the crawldb. > It does however got into the linkdb. Please tell me how I can collect these > 404 urls. > > Any help would be appreciated, > .,..bob > > case ProtocolStatus.NOTFOUND: > case ProtocolStatus.GONE: // gone > case ProtocolStatus.ACCESS_DENIED: > case ProtocolStatus.ROBOTS_DENIED: > output(fit.url, fit.datum, null, status, > CrawlDatum.STATUS_FETCH_GONE); // broken link is > getting here > break; > > On Fri, Feb 28, 2020 at 12:06 PM Robert Scavilla <rscavi...@gmail.com> > wrote: > >> Hi again, and thank you in advance for your kind help. >> >> I'm using Nutch 1.14 >> >> I'm trying to use nutch to find broken links (404s) on a site. I >> followed the instructions: >> bin/nutch readdb <crawlFolder>/crawldb/ -dump myDump >> >> but the dump only shows 200 and 301 status. There is no sign of any broken >> link. When enter just 1 broken link in the seed file the crawldb is empty. >> >> Please advise how I can inspect broken links with nutch1.14 >> >> Thank you! >> ...bob >> >