Hi, I am trying to make a Nutch plugin. I was wondering if it is possible to mark URLs based on content of a fetched page. The concept behind it is to prevent given URL from refetching in the future based on text content analysis.
What I have tried so far is extending ScoringFilter and keeping urls in a HashSet defined in my ScoringFilter and then updating CrawlDatum in updateDbScore, but it seems that the HashSet is not persistent throughout parsing and scoring process. As the documentation is very modest, I would like to ask community about what can I do with this problem. Kind regards

