Hi - Just use the bin/nutch invertlinks tool to create a new database,
which you can then read using the readlinkdb tool.

Op ma 7 okt 2024 om 13:03 schreef Hiran Chaudhuri
<[email protected]>:

> Hello Markus,
>
> Thank you for that answer. I am not familiar with 'invertlinks'.
>
> Where can I get more information, as how to run it and maybe how to
> inspect the results?
>
> Hiran
>
>
> On 07.10.24 11:59, Markus Jelsma wrote:
> > Well, you could use the invertlinks tool to find out from which URL is
> > linked to that one. If you find it, you should be able to reproduce that
> > link using the parserchecker tool, if still links to it.
> >
> > Do you have any url normalizer plugin active? They should deal with
> > relative paths i think.
> >
> > Op ma 7 okt 2024 om 11:51 schreef Hiran Chaudhuri
> > <[email protected]>:
> >
> >> While testing my protocol plugin I suddenly notice a url that just
> >> cannot get fetched. It looks like
> >>
> >> smb://host/../Folder1/Folder2/Folder3/Filename.extension
> >>
> >> Obviously there is a problem here as no SMB server would ever offer a
> >> share named '..'.
> >> Hence I'd like to know where this link came from.
> >>
> >> If it did not come through the protocol-plugin it must come from a
> >> parser or other source.
> >> What are the ways to discover that? Is there data in the CrawlDB that I
> >> am not yet aware of?
> >>
> >>
>

Reply via email to