[ 
https://issues.apache.org/jira/browse/NUTCH-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2075:
-----------------------------------
    Fix Version/s: 2.5

> Generate will not choose URL without distance marker
> ----------------------------------------------------
>
>                 Key: NUTCH-2075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2075
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3
>         Environment: Using HBase as back-end Storage
>            Reporter: Alexandre Demeyer
>            Priority: Minor
>              Labels: newbie, patch, performance
>             Fix For: 2.5
>
>
> It appears that there is a bug about certain links where nutch erases all 
> markers and not only the inject, generate, fetch, parse, update markers but 
> also the distance marker.
> The problem is that Nutch Generator doesn't check the validity of the marker 
> distance (check if it's null) and keep wrong links (without the distance 
> marker) in the GeneratorMapper. When the distance filter is activated, 
> GeneratorMapper choose also URL without markers and so it doesn't repect the 
> limit.
> I think it's in relation with the problem mention here : 
> [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930].
> This patch doesn't solved the problem which is all markers are erased 
> (without any reasons apparently ..). But it can allow to stop the crawl...
> In order to find a solution about stopping crawl with problematics URL, I 
> proposed this solution which is simply to avoid the URL when the distance 
> marker is NULL.
> (Sorry if i put the code here)
> {code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid}
> // filter on distance
>     if (maxDistance > -1) {
>       CharSequence distanceUtf8 = 
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
>       if (distanceUtf8 != null) {
>         int distance = Integer.parseInt(distanceUtf8.toString());
>         if (distance > maxDistance) {
>           return;
>         }
>       }
>     }
> {code}
> {code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid}
> // filter on distance
>     if (maxDistance > -1) {
>       CharSequence distanceUtf8 = 
> page.getMarkers().get(DbUpdaterJob.DISTANCE);
>       if (distanceUtf8 != null) {
>         int distance = Integer.parseInt(distanceUtf8.toString());
>         if (distance > maxDistance) {
>           return;
>         }
>       }
>       else
>       {
>         // No distance marker, URL problem
>         return;
>       }
>     }
> {code}
> Example of link where the problem appears (put an http.content.limit highter 
> than the content-length PDF) :
> http://www.annales.org/archives/x/marchal2.pdf
> Hope it can help ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to