Crawling while collecting resources

Joseph Naegele Mon, 08 Feb 2016 16:29:16 -0800

My goal is to use Nutch "normally" to craw, parse, extract links and index
said textual content but with the added goal of fetching and saving *all*
resources found at outlinks. It is my understanding that there is no
straightforward method for collecting resources this way, i.e. an extension
point. I found a few posts where users asked how to save the original
content of crawled resources. I'll address those options here:


1. Modify the fetcher code to "save" fetched resources
(http://stackoverflow.com/a/10060160/1689220). This is not a modular
approach.
2. Write an HtmlParseFilter that adds the original byte content to the parse
data, and an IndexingFilter that just adds the same content to the document
(http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't
think this makes sense for non-HTML resources.

Another approach would be to implement a "Parser" that isn't a parser at
all, but just stores the original resource content however I see fit, then
returns `null`, which causes Nutch to try the next configured Parser (e.g.
parse-tika). This might work, and I could even prevent things like images
from being passed to the real parser in my pseudo-parser.

If we just assume that textual resources contain outlinks and non-textual
resources do not, ideally Nutch would fetch *all* links, pass them to my
code for storing, and only pass textual resources on to parse and index.
What would be the best way to do this?

Thanks,
Joe

Crawling while collecting resources

Reply via email to