[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]
Sami Siren updated NUTCH-332:
-----------------------------
Fix Version/s: 0.9
(was: 0.8)
> doubling score causes by page internal anchors.
> -----------------------------------------------
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Stefan Groschupf
> Priority: Blocker
> Fix For: 0.9
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of
> anchors the scores of the page are distributed to its outlinks. But all this
> outlinks pointing to the page back. This causes that the page score is
> doubled.
> I'm not sure but may be this causes also a never ending fetching loop of this
> page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set
> CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107.
> So may be the status fetched will be overwritten with unfetched.
> In such a case we fetch the page every-time again and also every-time double
> the score of this page what causes very high scores without any reasons.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers