[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.
[ http://issues.apache.org/jira/browse/NUTCH-332?page=all ] Stefan Groschupf updated NUTCH-332: --- Attachment: scoreDoubling.patch A patch to solve this problem. This is a example page: http://bid.berkeley.edu/bidclass/readings/benjamin.html This page has several anchors that causes the problem in this case. What happens is: foo.com/a.html points to foo.com/a.html#chapter1 we normalize foo.com/a.html#chapter1 to: foo.com/a.html foo.com/a.html contributes all scores to foo.com/a.html. > doubling score causes by page internal anchors. > --- > > Key: NUTCH-332 > URL: http://issues.apache.org/jira/browse/NUTCH-332 > Project: Nutch > Issue Type: Bug >Affects Versions: 0.8-dev >Reporter: Stefan Groschupf >Priority: Blocker > Fix For: 0.8-dev > > Attachments: scoreDoubling.patch > > > When a page has no outlinks but several links to itself e.g. it has a set of > anchors the scores of the page are distributed to its outlinks. But all this > outlinks pointing to the page back. This causes that the page score is > doubled. > I'm not sure but may be this causes also a never ending fetching loop of this > page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set > CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. > So may be the status fetched will be overwritten with unfetched. > In such a case we fetch the page every-time again and also every-time double > the score of this page what causes very high scores without any reasons. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.
doubling score causes by page internal anchors. --- Key: NUTCH-332 URL: http://issues.apache.org/jira/browse/NUTCH-332 Project: Nutch Issue Type: Bug Affects Versions: 0.8-dev Reporter: Stefan Groschupf Priority: Blocker Fix For: 0.8-dev When a page has no outlinks but several links to itself e.g. it has a set of anchors the scores of the page are distributed to its outlinks. But all this outlinks pointing to the page back. This causes that the page score is doubled. I'm not sure but may be this causes also a never ending fetching loop of this page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. So may be the status fetched will be overwritten with unfetched. In such a case we fetch the page every-time again and also every-time double the score of this page what causes very high scores without any reasons. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs
Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs -- Key: NUTCH-331 URL: http://issues.apache.org/jira/browse/NUTCH-331 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8-dev, 0.9-dev Reporter: Andrzej Bialecki Priority: Critical Fix For: 0.8-dev, 0.9-dev Each Fetcher task starts multiple FetcherThreads, which consume the input fetchlist. These threads may block for a long time after being started and after reading their input fetchlist entries, due to "politeness" settings. However, the map-reduce framework considers the task as complete when all input data is read. This causes the tasktracker to incorreclty assume that task processing is complete (because the task progress is 1.0, since all input has been consumed), whereas many URLs from the fetchlist may still be waiting for fetching, in blocked threads. The more threads is used the more apparent is this problem, because the final number of fetched pages may be short of the target number by as many as (numThreads * numMapTasks) entries. The final result of this is that only a part of the fetchlist is fetched, because Fetcher map tasks are stopped when their progress is 1.0. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: 0.8 release
No objections form me. We waited long and we can fix things in maitenance release in few weeks. Regards Piotr On 7/26/06, Sami Siren <[EMAIL PROTECTED]> wrote: Andrzej Bialecki wrote: > Sami Siren wrote: > >> There is a package available for testing in >> http://people.apache.org/~siren/nutch-0.8/ >> >> please give it some testing and post in your opinion - is it good >> enough to be a public release? >> >> I have some doubts because of NUTCH-266, but so far only 3 people >> have reported this to be problem >> (me included) > > > This is I guess related to a very specific environment - multiple > nodes running on cygwin. Usually people run multiple nodes on some > flavor of Unix. > > I don't have any means to test it for this issue ... > Bug also appears in singlenode configuration, but I think that it is not that common (quessing from the number of people who have reported it). However that is now fixed in hadoop trunk. Should we use a patched version of hadoop-0.4.0 in Nutch or wait for 0.5 (which at least still seems to be 1.4 compatible)? The 0.8 package has now hit the mirrors, has anybody any objections about announcing it? Stefan allready commented about two issues he wished to be fixed in 0.8 but to me it looks that they can both be addressed with configuration changes and documentation in the first place and there's nothing stopping us from releasing 0.8.1 in very short time addressing the issues discovered in 0.8. -- Sami Siren