My goal is to use Nutch "normally" to craw, parse, extract links and index said textual content but with the added goal of fetching and saving *all* resources found at outlinks. It is my understanding that there is no straightforward method for collecting resources this way, i.e. an extension point. I found a few posts where users asked how to save the original content of crawled resources. I'll address those options here:
1. Modify the fetcher code to "save" fetched resources (http://stackoverflow.com/a/10060160/1689220). This is not a modular approach. 2. Write an HtmlParseFilter that adds the original byte content to the parse data, and an IndexingFilter that just adds the same content to the document (http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't think this makes sense for non-HTML resources. Another approach would be to implement a "Parser" that isn't a parser at all, but just stores the original resource content however I see fit, then returns `null`, which causes Nutch to try the next configured Parser (e.g. parse-tika). This might work, and I could even prevent things like images from being passed to the real parser in my pseudo-parser. If we just assume that textual resources contain outlinks and non-textual resources do not, ideally Nutch would fetch *all* links, pass them to my code for storing, and only pass textual resources on to parse and index. What would be the best way to do this? Thanks, Joe