Re: Get source of gone links

Sebastian Nagel Sun, 08 Dec 2024 11:48:32 -0800

Hi Peter,

first, you need the segments to get the link sources. So, let's assume,
they're still there and haven't been deleted...


The simplest way is to create the LinkDb - eventually, it is already there,
for example, if the script bin/crawl is used without modifications.
If not, the LinkDb is created by

  bin/nutch <linkdb> -dir <segmentsDir>

The LinkDb contains for every page / URL all incoming links. This data structure
makes it easy to search for the source of your 404s.

Please note, that there are configuration properties which kind(internal/external) or how many (per page) links to include into the LinkDb.


If the LinkDb is created, just query or export it via

  bin/nutch readlinkdb ...

(similar to reading the CrawlDb)

Best,
Sebastian

On 12/6/24 21:55, Peter Viskup wrote:

Just trying to figure out how to get the URIs from which the "gone" URIs
were linked.
ATM I have two domains crawled and indexed and was able to identify 120
gone links.

~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -stats|grep gone
2024-12-06 21:52:43,333 INFO o.a.n.c.CrawlDbReader [main] status 3
(db_gone):   120

generate CSV export
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -dump ./dbdump -format csv

and then grep for "gone"
~$ grep gone dbdump/part-r-00000 |wc -l
120

So how to get the source URIs of those "gones"?

Peter

Re: Get source of gone links

Reply via email to