Re: Plugin possibilities

Maciek Puzianowski Thu, 26 Dec 2024 23:27:20 -0800

Sure I will, but before that I thought I can show you my piece of code.
Perhaps, there is something I didn`t catch earlier.


public void passScoreAfterParsing(Text url, Content content, Parse parse)
throws ScoringFilterException {
        if (new Random().nextBoolean()) {
            parse.getData().getParseMeta().set(SHOULD_REFETCH_AND_INDEX,
"false");
        } else {
            parse.getData().getParseMeta().set(SHOULD_REFETCH_AND_INDEX,
"true");
        }
    }

public CrawlDatum distributeScoreToOutlinks(Text urlWr, ParseData
parseData, Collection<Map.Entry<Text, CrawlDatum>> collection, CrawlDatum
adjust, int i) throws ScoringFilterException {
        adjust = new CrawlDatum();
        int fetchStatus =
Integer.parseInt(parseData.getParseMeta().get(Nutch.FETCH_STATUS_KEY));
        adjust.setStatus(fetchStatus);

        if(parseData.getParseMeta().get(SHOULD_REFETCH_AND_INDEX) != null){
            boolean flag =
parseData.getParseMeta().get(SHOULD_REFETCH_AND_INDEX).equals("true");
            adjust.getMetaData().put(new Text(SHOULD_REFETCH_AND_INDEX),
new Text(Boolean.toString(flag)));
        }

        return adjust;
    }

public void updateDbScore(Text urlText, CrawlDatum old, CrawlDatum datum,
List<CrawlDatum> list) throws ScoringFilterException {
        Text oldTag = (Text) old.getMetaData().get(new
Text(SHOULD_REFETCH_AND_INDEX));
        Text datumTag = (Text) datum.getMetaData().get(new
Text(SHOULD_REFETCH_AND_INDEX));
        if(oldTag != null && oldTag.toString().equals("false")){
            datum.setFetchInterval(Integer.MAX_VALUE);
        }
        if(datumTag != null && datumTag.toString().equals("false")){
            datum.setFetchInterval(Integer.MAX_VALUE);
        }
    }

That is the code to set CrawlDatum metadata. After that in IndexingFilter I
get NPE when I try to reach that flag in CrawlDatum metadata.

Text shouldIndex = (Text) crawlDatum.getMetaData().get(new
Text(SHOULD_REFETCH_AND_INDEX));

The line above sometimes gives me a NPE even though all of CrawlDatum rows
has that tag in metadata.

If you can please let me know if you see some pieces that could make this
problem in my solution.

Best,
Maciek

Hi Maciek,
>
>  > Unless I do some silly mistake, I think it is worth a Jira issue.
>
> sure. Please open an issue. Of course, would be great if it is
> reproducible,
> ideally without any custom plugins.
>
> Best,
> Sebastian
>
>
> On 12/13/24 09:31, Maciek Puzianowski wrote:
> > In the first 2-3 crawls there are URLs in CrawlDb with either db_fetched
> or
> > db_unfetched status. Only db_fetched urls go through IndexingFilter and
> > even though all of these URLs have my custom tag in their metadata
> > (explored in dumped crawldb) - some of them kind of randomly get NPE when
> > trying to reach that tag.
> > However, it is not that bad, because I can pass my tag to IndexingFilter
> > via ParseData and then there is no problem.
> >
> > Unless I do some silly mistake, I think it is worth a Jira issue.
> >
> > Best,
> > Maciek
> >
> > pt., 13 gru 2024 o 00:23 Sebastian Nagel <[email protected]
> .invalid>
> > napisał(a):
> >
> >> Hi Maciek,
> >>
> >>   > However, sometimes this gets me a NullPointerException and it is
> >>   > kind of weird to me, because I have double checked and dumped
> CrawlDb
> >> and
> >>   > these URLs have this tag in metadata.
> >>
> >> Are there other URLs / items in the CrawlDb as well? I'd especially look
> >> at
> >> unfetched ones, as these may not have the Metadata initialized.
> >>
> >> Otherwise difficult to tell...
> >>
> >> Best,
> >> Sebastian
> >>
> >> On 12/12/24 13:06, Maciek Puzianowski wrote:
> >>> Sebastian,
> >>> thank you very much for your help. I have come to the exact same
> solution
> >>> following Nutch code and it works like a charm. Although, I have one
> more
> >>> concern. I wanted to use CrawlDatum metadata that I have updated in
> >>> distributeScoreToOutlinks method in IndexFilter. I would do (in
> >> IndexFilter)
> >>>
> >>> Text shouldIndex = (Text) crawlDatum.getMetaData().get(new
> >>> Text(SHOULD_REFETCH_AND_INDEX));
> >>>
> >>> that is the metadata tag that I have put in distributeScoreToOutlinks
> >>> method. However, sometimes this gets me a NullPointerException and it
> is
> >>> kind of weird to me, because I have double checked and dumped CrawlDb
> and
> >>> these URLs have this tag in metadata.
> >>> Any hints on that?
> >>>
> >>> Again, thank you Sebastian for your response.
> >>>
> >>> Best regards
> >>> Maciek
> >>>
> >>> czw., 12 gru 2024 o 10:42 Sebastian Nagel
> >>> <[email protected]> napisał(a):
> >>>
> >>>> Hi Maciek,
> >>>>
> >>>>    > The concept behind it is to prevent given URL from refetching in
> the
> >>>> future
> >>>>    > based on text content analysis.
> >>>>
> >>>>    > extending ScoringFilter
> >>>>
> >>>> Yes, it's the right plugin type to implement such a feature.
> >>>>
> >>>>    > keeping urls in a HashSet defined in my ScoringFilter and then
> >> updating
> >>>>    > CrawlDatum in updateDbScore, but it seems that the HashSet is not
> >>>> persistent
> >>>>    > throughout parsing and scoring process.
> >>>>
> >>>> Indeed. Everything which should be persistent needs to be stored in
> >> Nutch
> >>>> data structures. Assumed the "text content analysis" is done during
> the
> >>>> parsing, the flag or score needs to be passed forward via
> >>>>     - passScoreAfterParsing
> >>>>     - distributeScoreToOutlinks
> >>>>       (in addition to passing stuff to outlinks but you can "adjust"
> the
> >>>>        CrawlDatum of the page being processed)
> >>>>     - updateDbScore
> >>>>       - here you would modify the next fetch time of the
> >>>>         page, eventually also the retry interval
> >>>>       - if necessary you can store additional information in the
> >> CrawlDatum's
> >>>>         metadata
> >>>>
> >>>>
> >>>>    > As the documentation is very modest,
> >>>>
> >>>> I agree. The wiki page [1] needs for sure an overhaul.
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>>
> >>>> [1] https://cwiki.apache.org/confluence/display/nutch/NutchScoring
> >>>>
> >>>>
> >>>> On 12/10/24 12:15, Maciek Puzianowski wrote:
> >>>>> Hi,
> >>>>> I am trying to make a Nutch plugin.
> >>>>> I was wondering if it is possible to mark URLs based on content of a
> >>>>> fetched page.
> >>>>> The concept behind it is to prevent given URL from refetching in the
> >>>> future
> >>>>> based on text content analysis.
> >>>>>
> >>>>> What I have tried so far is extending ScoringFilter and keeping urls
> >> in a
> >>>>> HashSet defined in my ScoringFilter and then updating CrawlDatum in
> >>>>> updateDbScore, but it seems that the HashSet is not persistent
> >> throughout
> >>>>> parsing and scoring process.
> >>>>>
> >>>>> As the documentation is very modest, I would like to ask community
> >> about
> >>>>> what can I do with this problem.
> >>>>>
> >>>>> Kind regards
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Plugin possibilities

Reply via email to