I'm seeing many cases where ParserChecker finds outlinks in a document, but
when running crawl on this document they do not appear in the crawl DB at
all (and are not indexed). 

My URL filters are trivial as far as I can tell, and the missing links are
not special in any way that I can see.

For example:

/bin/nutch parsechecker -dumpText "http://corporate.exxonmobil.com/";

finds, among others, the URLs https://energyfactor.exxonmobil.com/ and
http://corporate.exxonmobil.com/en/investors/corporate-governance.

However, when running

bin/crawl  urls_yossi yossi 2

with only http://corporate.exxonmobil.com/ in urls_yossi, and then dumping
yossi/crawldb (using `nutch readdb`), the two above URLs are not found.

When finished, the crawldb contains 786 entries, which is far below topN.

 

Any idea what could be causing these URLs to be ignored?

Reply via email to