doubling score causes by page internal anchors.
-----------------------------------------------
Key: NUTCH-332
URL: http://issues.apache.org/jira/browse/NUTCH-332
Project: Nutch
Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
Fix For: 0.8-dev
When a page has no outlinks but several links to itself e.g. it has a set of
anchors the scores of the page are distributed to its outlinks. But all this
outlinks pointing to the page back. This causes that the page score is doubled.
I'm not sure but may be this causes also a never ending fetching loop of this
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
So may be the status fetched will be overwritten with unfetched.
In such a case we fetch the page every-time again and also every-time double
the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers