Ok, I found the offending file by pure luck.

And I found both tools integrated into the nutch executable. So I ran

./runtime/local/bin/nutchinvertlinkscrawl/linkdb-dircrawl/segments
./runtime/local/bin/nutch readlinkdb crawl/linkdb -dump dumplinks

In the file dumplinks/part-r-00000 I see one link (I reduced the crawl
to the offending file) but it is not the one I was expecting. Well, that
file has several outgoing links but the bad one is not contained?
On the other hand I can find the bad link using grep both in crawldb and
in the segments:

$ grep -R "smb://host/../folder" crawl
grep: crawl/crawldb/old/part-r-00000/data: binary file matches
grep: crawl/crawldb/current/part-r-00000/data: binary file matches
grep: crawl/segments/20241007131754/crawl_fetch/part-r-00000/data:
binary file matches
grep: crawl/segments/20241007131754/crawl_generate/part-r-00000: binary
file matches
grep: crawl/segments/20241007131655/crawl_parse/part-r-00000: binary
file matches
$

Something is still fishy.


On 07.10.24 13:11, Markus Jelsma wrote:
Hi - Just use the bin/nutch invertlinks tool to create a new database,
which you can then read using the readlinkdb tool.

Op ma 7 okt 2024 om 13:03 schreef Hiran Chaudhuri
<[email protected]>:

Hello Markus,

Thank you for that answer. I am not familiar with 'invertlinks'.

Where can I get more information, as how to run it and maybe how to
inspect the results?

Hiran


On 07.10.24 11:59, Markus Jelsma wrote:
Well, you could use the invertlinks tool to find out from which URL is
linked to that one. If you find it, you should be able to reproduce that
link using the parserchecker tool, if still links to it.

Do you have any url normalizer plugin active? They should deal with
relative paths i think.

Op ma 7 okt 2024 om 11:51 schreef Hiran Chaudhuri
<[email protected]>:

While testing my protocol plugin I suddenly notice a url that just
cannot get fetched. It looks like

smb://host/../Folder1/Folder2/Folder3/Filename.extension

Obviously there is a problem here as no SMB server would ever offer a
share named '..'.
Hence I'd like to know where this link came from.

If it did not come through the protocol-plugin it must come from a
parser or other source.
What are the ways to discover that? Is there data in the CrawlDB that I
am not yet aware of?


Reply via email to