Hi Peter,

first, you need the segments to get the link sources. So, let's assume,
they're still there and haven't been deleted...

The simplest way is to create the LinkDb - eventually, it is already there,
for example, if the script bin/crawl is used without modifications.
If not, the LinkDb is created by

  bin/nutch <linkdb> -dir <segmentsDir>

The LinkDb contains for every page / URL all incoming links. This data structure
makes it easy to search for the source of your 404s.

Please note, that there are configuration properties which kind (internal/external) or how many (per page) links to include into the LinkDb.

If the LinkDb is created, just query or export it via

  bin/nutch readlinkdb ...

(similar to reading the CrawlDb)

Best,
Sebastian

On 12/6/24 21:55, Peter Viskup wrote:
Just trying to figure out how to get the URIs from which the "gone" URIs
were linked.
ATM I have two domains crawled and indexed and was able to identify 120
gone links.

~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -stats|grep gone
2024-12-06 21:52:43,333 INFO o.a.n.c.CrawlDbReader [main] status 3
(db_gone):   120

generate CSV export
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -dump ./dbdump -format csv

and then grep for "gone"
~$ grep gone dbdump/part-r-00000 |wc -l
120

So how to get the source URIs of those "gones"?

Peter


Reply via email to