Hi,
I am trying to make a Nutch plugin.
I was wondering if it is possible to mark URLs based on content of a
fetched page.
The concept behind it is to prevent given URL from refetching in the future
based on text content analysis.

What I have tried so far is extending ScoringFilter and keeping urls in a
HashSet defined in my ScoringFilter and then updating CrawlDatum in
updateDbScore, but it seems that the HashSet is not persistent throughout
parsing and scoring process.

As the documentation is very modest, I would like to ask community about
what can I do with this problem.

Kind regards

Reply via email to