Hello, to restrict parsing to xhtml/html pages, you can replace the asterisk in Tika's plugin.xml with the list of MIME-types. <parameter name="contentType" value="text/html|application/xhtml+xml"/>
-----Original message----- > From:Joseph Naegele <jnaeg...@grierforensics.com> > Sent: Tuesday 16th February 2016 1:18 > To: user@nutch.apache.org > Subject: RE: Crawling while collecting resources > > I discovered that the Protocol extension point is a good place to do this, > since it is responsible for actually fetching content. > > Is it possible with Nutch to fetch content that I may not want to > parse/index? > > Example: I want to fetch images in addition to HTML, but I only want the > HTML to be parsed and indexed. > > Thanks, > Joe > > -----Original Message----- > From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] > Sent: Monday, February 08, 2016 7:29 PM > To: 'user@nutch.apache.org' <user@nutch.apache.org> > Subject: Crawling while collecting resources > > My goal is to use Nutch "normally" to craw, parse, extract links and index > said textual content but with the added goal of fetching and saving *all* > resources found at outlinks. It is my understanding that there is no > straightforward method for collecting resources this way, i.e. an extension > point. I found a few posts where users asked how to save the original > content of crawled resources. I'll address those options here: > > 1. Modify the fetcher code to "save" fetched resources > (http://stackoverflow.com/a/10060160/1689220). This is not a modular > approach. > 2. Write an HtmlParseFilter that adds the original byte content to the parse > data, and an IndexingFilter that just adds the same content to the document > (http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't > think this makes sense for non-HTML resources. > > Another approach would be to implement a "Parser" that isn't a parser at > all, but just stores the original resource content however I see fit, then > returns `null`, which causes Nutch to try the next configured Parser (e.g. > parse-tika). This might work, and I could even prevent things like images > from being passed to the real parser in my pseudo-parser. > > If we just assume that textual resources contain outlinks and non-textual > resources do not, ideally Nutch would fetch *all* links, pass them to my > code for storing, and only pass textual resources on to parse and index. > What would be the best way to do this? > > Thanks, > Joe > >