Re: Troubleshooting Nutch - why is this URL being fetched?

Hiran Chaudhuri Sat, 19 Oct 2024 12:42:02 -0700


I tried to follow this fine tutorial, which does not only teach how to
crawl the seed urls but also shows how to check the content in the
database. Chapter 3.5 even mentions to create the inverted link database.


Unfortunately this does not work out for me. I do not see any data. Oh,
there is additional information:

/Note that this database of “inlinks” only includes cross-domain links,
so it will only contain any links from one page to another that come
from a different domain./

I am scanning one URL only, so obviously the other URLs all come from
the same domain. I am left with the same question:

*How can I find out why a specific URL is being fetched?*


On 07.10.24 13:36, Hiran Chaudhuri wrote:

Ok, I found the offending file by pure luck.

And I found both tools integrated into the nutch executable. So I ran

./runtime/local/bin/nutchinvertlinkscrawl/linkdb-dircrawl/segments
./runtime/local/bin/nutch readlinkdb crawl/linkdb -dump dumplinks

In the file dumplinks/part-r-00000 I see one link (I reduced the crawl
to the offending file) but it is not the one I was expecting. Well, that
file has several outgoing links but the bad one is not contained?
On the other hand I can find the bad link using grep both in crawldb and
in the segments:

$ grep -R "smb://host/../folder" crawl
grep: crawl/crawldb/old/part-r-00000/data: binary file matches
grep: crawl/crawldb/current/part-r-00000/data: binary file matches
grep: crawl/segments/20241007131754/crawl_fetch/part-r-00000/data:
binary file matches
grep: crawl/segments/20241007131754/crawl_generate/part-r-00000: binary
file matches
grep: crawl/segments/20241007131655/crawl_parse/part-r-00000: binary
file matches
$

Something is still fishy.


On 07.10.24 13:11, Markus Jelsma wrote:

Hi - Just use the bin/nutch invertlinks tool to create a new database,
which you can then read using the readlinkdb tool.

Op ma 7 okt 2024 om 13:03 schreef Hiran Chaudhuri
<[email protected]>:

Hello Markus,

Thank you for that answer. I am not familiar with 'invertlinks'.

Where can I get more information, as how to run it and maybe how to
inspect the results?

Hiran


On 07.10.24 11:59, Markus Jelsma wrote:

Well, you could use the invertlinks tool to find out from which URL is
linked to that one. If you find it, you should be able to reproduce
that
link using the parserchecker tool, if still links to it.

Do you have any url normalizer plugin active? They should deal with
relative paths i think.

Op ma 7 okt 2024 om 11:51 schreef Hiran Chaudhuri
<[email protected]>:

While testing my protocol plugin I suddenly notice a url that just
cannot get fetched. It looks like

smb://host/../Folder1/Folder2/Folder3/Filename.extension

Obviously there is a problem here as no SMB server would ever offer a
share named '..'.
Hence I'd like to know where this link came from.

If it did not come through the protocol-plugin it must come from a
parser or other source.
What are the ways to discover that? Is there data in the CrawlDB
that I
am not yet aware of?

Re: Troubleshooting Nutch - why is this URL being fetched?

Reply via email to