Sebastian, thank you very much for your help. I have come to the exact same solution following Nutch code and it works like a charm. Although, I have one more concern. I wanted to use CrawlDatum metadata that I have updated in distributeScoreToOutlinks method in IndexFilter. I would do (in IndexFilter)
Text shouldIndex = (Text) crawlDatum.getMetaData().get(new Text(SHOULD_REFETCH_AND_INDEX)); that is the metadata tag that I have put in distributeScoreToOutlinks method. However, sometimes this gets me a NullPointerException and it is kind of weird to me, because I have double checked and dumped CrawlDb and these URLs have this tag in metadata. Any hints on that? Again, thank you Sebastian for your response. Best regards Maciek czw., 12 gru 2024 o 10:42 Sebastian Nagel <[email protected]> napisaĆ(a): > Hi Maciek, > > > The concept behind it is to prevent given URL from refetching in the > future > > based on text content analysis. > > > extending ScoringFilter > > Yes, it's the right plugin type to implement such a feature. > > > keeping urls in a HashSet defined in my ScoringFilter and then updating > > CrawlDatum in updateDbScore, but it seems that the HashSet is not > persistent > > throughout parsing and scoring process. > > Indeed. Everything which should be persistent needs to be stored in Nutch > data structures. Assumed the "text content analysis" is done during the > parsing, the flag or score needs to be passed forward via > - passScoreAfterParsing > - distributeScoreToOutlinks > (in addition to passing stuff to outlinks but you can "adjust" the > CrawlDatum of the page being processed) > - updateDbScore > - here you would modify the next fetch time of the > page, eventually also the retry interval > - if necessary you can store additional information in the CrawlDatum's > metadata > > > > As the documentation is very modest, > > I agree. The wiki page [1] needs for sure an overhaul. > > Best, > Sebastian > > > [1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring > > > On 12/10/24 12:15, Maciek Puzianowski wrote: > > Hi, > > I am trying to make a Nutch plugin. > > I was wondering if it is possible to mark URLs based on content of a > > fetched page. > > The concept behind it is to prevent given URL from refetching in the > future > > based on text content analysis. > > > > What I have tried so far is extending ScoringFilter and keeping urls in a > > HashSet defined in my ScoringFilter and then updating CrawlDatum in > > updateDbScore, but it seems that the HashSet is not persistent throughout > > parsing and scoring process. > > > > As the documentation is very modest, I would like to ask community about > > what can I do with this problem. > > > > Kind regards > > > >

