Hi Maciek,

> Unless I do some silly mistake, I think it is worth a Jira issue.

sure. Please open an issue. Of course, would be great if it is reproducible,
ideally without any custom plugins.

Best,
Sebastian


On 12/13/24 09:31, Maciek Puzianowski wrote:
In the first 2-3 crawls there are URLs in CrawlDb with either db_fetched or
db_unfetched status. Only db_fetched urls go through IndexingFilter and
even though all of these URLs have my custom tag in their metadata
(explored in dumped crawldb) - some of them kind of randomly get NPE when
trying to reach that tag.
However, it is not that bad, because I can pass my tag to IndexingFilter
via ParseData and then there is no problem.

Unless I do some silly mistake, I think it is worth a Jira issue.

Best,
Maciek

pt., 13 gru 2024 o 00:23 Sebastian Nagel <[email protected]>
napisał(a):

Hi Maciek,

  > However, sometimes this gets me a NullPointerException and it is
  > kind of weird to me, because I have double checked and dumped CrawlDb
and
  > these URLs have this tag in metadata.

Are there other URLs / items in the CrawlDb as well? I'd especially look
at
unfetched ones, as these may not have the Metadata initialized.

Otherwise difficult to tell...

Best,
Sebastian

On 12/12/24 13:06, Maciek Puzianowski wrote:
Sebastian,
thank you very much for your help. I have come to the exact same solution
following Nutch code and it works like a charm. Although, I have one more
concern. I wanted to use CrawlDatum metadata that I have updated in
distributeScoreToOutlinks method in IndexFilter. I would do (in
IndexFilter)

Text shouldIndex = (Text) crawlDatum.getMetaData().get(new
Text(SHOULD_REFETCH_AND_INDEX));

that is the metadata tag that I have put in distributeScoreToOutlinks
method. However, sometimes this gets me a NullPointerException and it is
kind of weird to me, because I have double checked and dumped CrawlDb and
these URLs have this tag in metadata.
Any hints on that?

Again, thank you Sebastian for your response.

Best regards
Maciek

czw., 12 gru 2024 o 10:42 Sebastian Nagel
<[email protected]> napisał(a):

Hi Maciek,

   > The concept behind it is to prevent given URL from refetching in the
future
   > based on text content analysis.

   > extending ScoringFilter

Yes, it's the right plugin type to implement such a feature.

   > keeping urls in a HashSet defined in my ScoringFilter and then
updating
   > CrawlDatum in updateDbScore, but it seems that the HashSet is not
persistent
   > throughout parsing and scoring process.

Indeed. Everything which should be persistent needs to be stored in
Nutch
data structures. Assumed the "text content analysis" is done during the
parsing, the flag or score needs to be passed forward via
    - passScoreAfterParsing
    - distributeScoreToOutlinks
      (in addition to passing stuff to outlinks but you can "adjust" the
       CrawlDatum of the page being processed)
    - updateDbScore
      - here you would modify the next fetch time of the
        page, eventually also the retry interval
      - if necessary you can store additional information in the
CrawlDatum's
        metadata


   > As the documentation is very modest,

I agree. The wiki page [1] needs for sure an overhaul.

Best,
Sebastian


[1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring


On 12/10/24 12:15, Maciek Puzianowski wrote:
Hi,
I am trying to make a Nutch plugin.
I was wondering if it is possible to mark URLs based on content of a
fetched page.
The concept behind it is to prevent given URL from refetching in the
future
based on text content analysis.

What I have tried so far is extending ScoringFilter and keeping urls
in a
HashSet defined in my ScoringFilter and then updating CrawlDatum in
updateDbScore, but it seems that the HashSet is not persistent
throughout
parsing and scoring process.

As the documentation is very modest, I would like to ask community
about
what can I do with this problem.

Kind regards








Reply via email to