Re: Re: How to extend Nutch for article crawling

Julien Nioche Wed, 19 Dec 2012 01:32:15 -0800

Hi

>>> The callback method in the IndexingFilter has a 'URL' parameter and
> returns NutchDocument, so it is hard to be customized to do this.
>


I did not mean the IndexingFilter but using a standard URLFilter during the
indexing step


> >>> So, it's better to add 'skip' ability to the IndexingFilter based on
> URL or medadata.
>         @Override
>         public NutchDocument filter(NutchDocument doc, String url, WebPage
> page)


well we could nullify a document in an IndexingFilter based on the metadata
found in the webpage indeed, which would save us hacking the code of the
indexer. That's assuming that the rest of the chain behaves nicely in the
presence of a null.

[...]


> >>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR
> is an amazing tool for indexing, however I'm not quit sure whether it is
> good to store the 'content' inside it. By default, the 'content' is
> configured only to be indexed but stored. What do you think?


many people see SOLR as a distributed NoSQL datastore which has also the
additional benefit of indexing + search. Whether it is a good match for
your needs depends on the sort of queries you'd want to run on it. It can
definitely store the content and I believe that the recent versions have
great performance for compression so this is not really an issue.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Re: How to extend Nutch for article crawling

Reply via email to