date:20060727

[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-27 Thread Stefan Groschupf (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Stefan Groschupf updated NUTCH-332:
---

Attachment: scoreDoubling.patch

A patch to solve this problem. 

This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.

What happens is: 
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html

foo.com/a.html contributes all scores to foo.com/a.html. 


> doubling score causes by page internal anchors.
> ---
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.8-dev
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of 
> anchors the scores of the page are distributed to its outlinks. But all this 
> outlinks pointing to the page back. This causes that the page score is 
> doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this 
> page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
> CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double 
> the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-27 Thread Stefan Groschupf (JIRA)

doubling score causes by page internal anchors.
---

 Key: NUTCH-332
 URL: http://issues.apache.org/jira/browse/NUTCH-332
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of 
anchors the scores of the page are distributed to its outlinks. But all this 
outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this 
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double 
the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

2006-07-27 Thread Andrzej Bialecki (JIRA)

Fetcher incorrectly reports task progress to tasktracker resulting in skipped 
URLs
--

 Key: NUTCH-331
 URL: http://issues.apache.org/jira/browse/NUTCH-331
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8-dev, 0.9-dev
Reporter: Andrzej Bialecki 
Priority: Critical
 Fix For: 0.8-dev, 0.9-dev


Each Fetcher task starts multiple FetcherThreads, which consume the input 
fetchlist. These threads may block for a long time after being started and 
after reading their input fetchlist entries, due to "politeness" settings. 
However, the map-reduce framework considers the task as complete when all input 
data is read.

This causes the tasktracker to incorreclty assume that task processing is 
complete (because the task progress is 1.0, since all input has been consumed), 
whereas many URLs from the fetchlist may still be waiting for fetching, in 
blocked threads. The more threads is used the more apparent is this problem, 
because the final number of fetched pages may be short of the target number by 
as many as (numThreads * numMapTasks) entries.

The final result of this is that only a part of the fetchlist is fetched, 
because Fetcher map tasks are stopped when their progress is 1.0.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: 0.8 release

2006-07-27 Thread Piotr Kosiorowski

No objections form me. We waited long and we can fix things in
maitenance release in few weeks.
Regards
Piotr

On 7/26/06, Sami Siren <[EMAIL PROTECTED]> wrote:

Andrzej Bialecki wrote:

> Sami Siren wrote:
>
>> There is a package available for testing in
>> http://people.apache.org/~siren/nutch-0.8/
>>
>> please give it some testing and post in your opinion - is it good
>> enough to be a public release?
>>
>> I have some doubts because of NUTCH-266, but so far only 3 people
>> have reported this to be problem
>> (me included)
>
>
> This is I guess related to a very specific environment - multiple
> nodes running on cygwin. Usually people run multiple nodes on some
> flavor of Unix.
>
> I don't have any means to test it for this issue ...
>
Bug also appears in singlenode configuration, but I think that it is not
that common (quessing from the number of people who have reported it).
However that is now fixed in hadoop trunk. Should we use a patched
version of hadoop-0.4.0 in Nutch or wait for 0.5 (which at least still
seems to be 1.4 compatible)?

The 0.8 package has now hit the mirrors, has anybody any objections
about announcing it? Stefan allready commented about two issues he
wished to be fixed in 0.8 but to me it looks that they can both be
addressed with configuration changes and documentation in the first
place and there's nothing stopping us from releasing 0.8.1 in very short
time addressing the issues discovered in 0.8.

--
 Sami Siren

[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

[jira] Created: (NUTCH-331) Fetcher incorrectly reports task progress to tasktracker resulting in skipped URLs

Re: 0.8 release

4 matches

Site Navigation

Mail list logo

Footer information