Hi - Just use the bin/nutch invertlinks tool to create a new database, which you can then read using the readlinkdb tool.
Op ma 7 okt 2024 om 13:03 schreef Hiran Chaudhuri <[email protected]>: > Hello Markus, > > Thank you for that answer. I am not familiar with 'invertlinks'. > > Where can I get more information, as how to run it and maybe how to > inspect the results? > > Hiran > > > On 07.10.24 11:59, Markus Jelsma wrote: > > Well, you could use the invertlinks tool to find out from which URL is > > linked to that one. If you find it, you should be able to reproduce that > > link using the parserchecker tool, if still links to it. > > > > Do you have any url normalizer plugin active? They should deal with > > relative paths i think. > > > > Op ma 7 okt 2024 om 11:51 schreef Hiran Chaudhuri > > <[email protected]>: > > > >> While testing my protocol plugin I suddenly notice a url that just > >> cannot get fetched. It looks like > >> > >> smb://host/../Folder1/Folder2/Folder3/Filename.extension > >> > >> Obviously there is a problem here as no SMB server would ever offer a > >> share named '..'. > >> Hence I'd like to know where this link came from. > >> > >> If it did not come through the protocol-plugin it must come from a > >> parser or other source. > >> What are the ways to discover that? Is there data in the CrawlDB that I > >> am not yet aware of? > >> > >> >

