RE: Crawling while collecting resources

Markus Jelsma Tue, 16 Feb 2016 03:45:43 -0800

Hello, to restrict parsing to xhtml/html pages, you can replace the asterisk in 
Tika's plugin.xml with the list of MIME-types.
 <parameter name="contentType" value="text/html|application/xhtml+xml"/>


-----Original message-----
> From:Joseph Naegele <jnaeg...@grierforensics.com>
> Sent: Tuesday 16th February 2016 1:18
> To: user@nutch.apache.org
> Subject: RE: Crawling while collecting resources
> 
> I discovered that the Protocol extension point is a good place to do this,
> since it is responsible for actually fetching content.
> 
> Is it possible with Nutch to fetch content that I may not want to
> parse/index?
> 
> Example: I want to fetch images in addition to HTML, but I only want the
> HTML to be parsed and indexed.
> 
> Thanks,
> Joe
> 
> -----Original Message-----
> From: Joseph Naegele [mailto:jnaeg...@grierforensics.com] 
> Sent: Monday, February 08, 2016 7:29 PM
> To: 'user@nutch.apache.org' <user@nutch.apache.org>
> Subject: Crawling while collecting resources
> 
> My goal is to use Nutch "normally" to craw, parse, extract links and index
> said textual content but with the added goal of fetching and saving *all*
> resources found at outlinks. It is my understanding that there is no
> straightforward method for collecting resources this way, i.e. an extension
> point. I found a few posts where users asked how to save the original
> content of crawled resources. I'll address those options here:
> 
> 1. Modify the fetcher code to "save" fetched resources
> (http://stackoverflow.com/a/10060160/1689220). This is not a modular
> approach.
> 2. Write an HtmlParseFilter that adds the original byte content to the parse
> data, and an IndexingFilter that just adds the same content to the document
> (http://www.mail-archive.com/user%40nutch.apache.org/msg03659.html). I don't
> think this makes sense for non-HTML resources.
> 
> Another approach would be to implement a "Parser" that isn't a parser at
> all, but just stores the original resource content however I see fit, then
> returns `null`, which causes Nutch to try the next configured Parser (e.g.
> parse-tika). This might work, and I could even prevent things like images
> from being passed to the real parser in my pseudo-parser.
> 
> If we just assume that textual resources contain outlinks and non-textual
> resources do not, ideally Nutch would fetch *all* links, pass them to my
> code for storing, and only pass textual resources on to parse and index.
> What would be the best way to do this?
> 
> Thanks,
> Joe
> 
>

RE: Crawling while collecting resources

Reply via email to