Just trying to figure out how to get the URIs from which the "gone" URIs
were linked.
ATM I have two domains crawled and indexed and was able to identify 120
gone links.
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -stats|grep gone
2024-12-06 21:52:43,333 INFO o.a.n.c.CrawlDbReader [main] status 3
(db_gone): 120
generate CSV export
~$ ${NUTCH_RUNTIME_HOME}/bin/nutch readdb
${NUTCH_RUNTIME_HOME}/crawl/segments/crawldb/ -dump ./dbdump -format csv
and then grep for "gone"
~$ grep gone dbdump/part-r-00000 |wc -l
120
So how to get the source URIs of those "gones"?
Peter