Hi Maciek,
> However, sometimes this gets me a NullPointerException and it is
> kind of weird to me, because I have double checked and dumped CrawlDb and
> these URLs have this tag in metadata.
Are there other URLs / items in the CrawlDb as well? I'd especially look at
unfetched ones, as these may not have the Metadata initialized.
Otherwise difficult to tell...
Best,
Sebastian
On 12/12/24 13:06, Maciek Puzianowski wrote:
Sebastian,
thank you very much for your help. I have come to the exact same solution
following Nutch code and it works like a charm. Although, I have one more
concern. I wanted to use CrawlDatum metadata that I have updated in
distributeScoreToOutlinks method in IndexFilter. I would do (in IndexFilter)
Text shouldIndex = (Text) crawlDatum.getMetaData().get(new
Text(SHOULD_REFETCH_AND_INDEX));
that is the metadata tag that I have put in distributeScoreToOutlinks
method. However, sometimes this gets me a NullPointerException and it is
kind of weird to me, because I have double checked and dumped CrawlDb and
these URLs have this tag in metadata.
Any hints on that?
Again, thank you Sebastian for your response.
Best regards
Maciek
czw., 12 gru 2024 o 10:42 Sebastian Nagel
<[email protected]> napisaĆ(a):
Hi Maciek,
> The concept behind it is to prevent given URL from refetching in the
future
> based on text content analysis.
> extending ScoringFilter
Yes, it's the right plugin type to implement such a feature.
> keeping urls in a HashSet defined in my ScoringFilter and then updating
> CrawlDatum in updateDbScore, but it seems that the HashSet is not
persistent
> throughout parsing and scoring process.
Indeed. Everything which should be persistent needs to be stored in Nutch
data structures. Assumed the "text content analysis" is done during the
parsing, the flag or score needs to be passed forward via
- passScoreAfterParsing
- distributeScoreToOutlinks
(in addition to passing stuff to outlinks but you can "adjust" the
CrawlDatum of the page being processed)
- updateDbScore
- here you would modify the next fetch time of the
page, eventually also the retry interval
- if necessary you can store additional information in the CrawlDatum's
metadata
> As the documentation is very modest,
I agree. The wiki page [1] needs for sure an overhaul.
Best,
Sebastian
[1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring
On 12/10/24 12:15, Maciek Puzianowski wrote:
Hi,
I am trying to make a Nutch plugin.
I was wondering if it is possible to mark URLs based on content of a
fetched page.
The concept behind it is to prevent given URL from refetching in the
future
based on text content analysis.
What I have tried so far is extending ScoringFilter and keeping urls in a
HashSet defined in my ScoringFilter and then updating CrawlDatum in
updateDbScore, but it seems that the HashSet is not persistent throughout
parsing and scoring process.
As the documentation is very modest, I would like to ask community about
what can I do with this problem.
Kind regards