Hi >>> The callback method in the IndexingFilter has a 'URL' parameter and > returns NutchDocument, so it is hard to be customized to do this. >
I did not mean the IndexingFilter but using a standard URLFilter during the indexing step > >>> So, it's better to add 'skip' ability to the IndexingFilter based on > URL or medadata. > @Override > public NutchDocument filter(NutchDocument doc, String url, WebPage > page) well we could nullify a document in an IndexingFilter based on the metadata found in the webpage indeed, which would save us hacking the code of the indexer. That's assuming that the rest of the chain behaves nicely in the presence of a null. [...] > >>> Good suggestion. It is not decided yet to depend on SOLR or not. SOLR > is an amazing tool for indexing, however I'm not quit sure whether it is > good to store the 'content' inside it. By default, the 'content' is > configured only to be indexed but stored. What do you think? many people see SOLR as a distributed NoSQL datastore which has also the additional benefit of indexing + search. Whether it is a good match for your needs depends on the sort of queries you'd want to run on it. It can definitely store the content and I believe that the recent versions have great performance for compression so this is not really an issue. Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble