RE: How to extend Nutch for article crawling

Markus Jelsma Mon, 17 Dec 2012 06:07:55 -0800
The 1.x indexer can filter and normalize. 
 
-----Original message-----
> From:Julien Nioche <lists.digitalpeb...@gmail.com>
> Sent: Mon 17-Dec-2012 15:11
> To: user@nutch.apache.org
> Subject: Re: How to extend Nutch for article crawling
> 
> Hi
> 
> See comments below
> 
> 
> > 1. Add article list pages into url/seed.txt
> >     Here's one problem. What I actually want to be indexed is the article
> > pages, not the article list pages. But, if I don't allow the list page to
> > be indexed, Nutch will do nothing because the list page is the entrance.
> > So, how can I index only the article page without list pages?
> >
> 
> I think that the indexer can now filter URLs but can't remember whether it
> is for 1.x only or is in 2.x as well. Anyone?
> This would work if you can find a regular expression that captures the list
> pages. Another approach would be to tweak the indexer so that it skips
> documents containing an arbitrary metadatum (e.g. skip.indexing), this
> metadata would be set in a custom parser when processing the list pages.
> 
> I think this would be a useful feature to have anyway. URL filters use the
> URL string only and having the option to skip based on metadata would be
> good IMHO
> 
> 
> >
> > 2. Write a plugin to parse out the 'author', 'date', 'article body',
> > 'headline' and maybe other information from html.
> >     The 'Parser' plugin interface in Nutch 2.1 is:
> >     Parse getParse(String url, WebPage page)
> >     And the 'WebPage' class has some predefined attributs:
> > public class WebPage extends PersistentBase {
> >   //...
> >   private Utf8 baseUrl;
> >   // ...
> >   private Utf8 title;
> >   private Utf8 text;
> >   // ...
> >   private Map<Utf8,ByteBuffer> metadata;
> >   // ...
> > }
> >
> >     So, the only field I can put my specified attributes in is the
> > 'metadata'. Is it designed for this purpose?
> >     BTW, the Parser in trunk looks like: 'public ParseResult
> > getParse(Content content)', and seems more reasonable for me.
> >
> 
> The extension point Parser is for low level parsing i.e extract text and
> metadata from binary formats, which is done typically by parse-tika. What
> you want to implement is an extension of ParseFilter and add your own
> entries to the parse metadata. The creative commons plugin should be a good
> example to get started
> 
> 
> >
> > 3. After the articles are indexed into Solr, another application can query
> > it by 'date' then store the article information into Mysql.
> >     My question here is: can Nutch store the article directly into Mysql?
> > Or can I write a plugin to specify the index behavior?
> >
> 
> you could use the mysql backend in GORA (but it is broken AFAIK) and get
> the other application to use it, alternatively you could write a custom
> indexer that sends directly into MySQL but that would be a bit redundant.
> Do you need to use SOLR at all or is the aim to simply to store in MySQL?
> 
> 
> >
> > Is Nutch a good choice for my purpose? If not, do you guys suggest another
> > good quality framework/library for me?
> >
> 
> You can definitely do that with Nutch. There are certainly other resources
> that could be used but they might also need a bit of customisation anyway
> 
> HTH
> 
> Julien
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
RE: How to extend Nutch for article crawling

Reply via email to