Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem.
Yes, look at ParseOutputFormat, you can make this decision there. There are two standard etension points where you can hook up - URLFilters and ScoringFilters.
Please note that if you use URLFilters to filter out URL-s too early then they will be rediscovered again and again. A better method to handle this, but also more complicated, is to still include such links but give them a special flag (in metadata) that prevents fetching. This requires that you implement a custom scoring plugin.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com