[ https://issues.apache.org/jira/browse/NUTCH-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2075: ----------------------------------- Fix Version/s: 2.5 > Generate will not choose URL without distance marker > ---------------------------------------------------- > > Key: NUTCH-2075 > URL: https://issues.apache.org/jira/browse/NUTCH-2075 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 2.3 > Environment: Using HBase as back-end Storage > Reporter: Alexandre Demeyer > Priority: Minor > Labels: newbie, patch, performance > Fix For: 2.5 > > > It appears that there is a bug about certain links where nutch erases all > markers and not only the inject, generate, fetch, parse, update markers but > also the distance marker. > The problem is that Nutch Generator doesn't check the validity of the marker > distance (check if it's null) and keep wrong links (without the distance > marker) in the GeneratorMapper. When the distance filter is activated, > GeneratorMapper choose also URL without markers and so it doesn't repect the > limit. > I think it's in relation with the problem mention here : > [NUTCH-1930|https://issues.apache.org/jira/browse/NUTCH-1930]. > This patch doesn't solved the problem which is all markers are erased > (without any reasons apparently ..). But it can allow to stop the crawl... > In order to find a solution about stopping crawl with problematics URL, I > proposed this solution which is simply to avoid the URL when the distance > marker is NULL. > (Sorry if i put the code here) > {code:title=crawl/GeneratorMapper.java (initial code)|borderStyle=solid} > // filter on distance > if (maxDistance > -1) { > CharSequence distanceUtf8 = > page.getMarkers().get(DbUpdaterJob.DISTANCE); > if (distanceUtf8 != null) { > int distance = Integer.parseInt(distanceUtf8.toString()); > if (distance > maxDistance) { > return; > } > } > } > {code} > {code:title=crawl/GeneratorMapper.java (patch code)|borderStyle=solid} > // filter on distance > if (maxDistance > -1) { > CharSequence distanceUtf8 = > page.getMarkers().get(DbUpdaterJob.DISTANCE); > if (distanceUtf8 != null) { > int distance = Integer.parseInt(distanceUtf8.toString()); > if (distance > maxDistance) { > return; > } > } > else > { > // No distance marker, URL problem > return; > } > } > {code} > Example of link where the problem appears (put an http.content.limit highter > than the content-length PDF) : > http://www.annales.org/archives/x/marchal2.pdf > Hope it can help ... -- This message was sent by Atlassian Jira (v8.3.4#803005)